Caches, compute, backend, ROPs, and architecture summaryFlexible caches - keeping the speed up
NVIDIA has also designed GF100 to keep as much data on the GPU as possible. Much like a CPU, the on-GPU caches work at a considerably faster rate than GPU-connected memory. To this end, Fermi increases L2 cache – the light-blue section in the middle of the picture – from 256KB to a unified 768KB and also introduces a configurable cache of 64KB per SM - a feature lacking in previous NVIDIA GPUs.
Interestingly, this can be split 16KB/48KB and broken down into shared/L1 cache and vice versa. The larger on-chip caches also mean that Fermi is in a better position to handle applications such as raytracing, where light-rays need to be calculated on-the-fly through constant accesses to chip memory, as the next set of compute isn’t known beforehand.
Compute machine and multitasking. But GTX 4x0 is crippled
As much of a general-purpose computer as a GPU, the parallel architecture is also designed for the high-performance computing segment in mind. The enhanced cache structure, detailed above, helps with general computations, and GF100’s adherence to the IEEE 754-2008 floating-point standard means that it can run high-accuracy tests (double-precision support) at an increased rate when compared to anything NVIDIA has designed before.
Delve a little deeper, handily not mentioned in any briefing, and NVIDIA is limiting the double-precision speed of the desktop GF100 part to one-eighth of single-precision throughput, rather than the one-fifth speed of the Radeon HD 5000-series. We'll have to wait for the Tesla parts before that's restored to the one-half speed the GF100 is capable of.
GF100 also supports a range of programming models from C- and C++-based CUDA to OpenCL, PhysX, and OptiX RayTracing. NVIDIA hopes to encourage a greater number of developers to use its new ‘compute machine’ for solving complex calculations. The GPU also supports faster switching between states - PhysX and rendering, for example.
Back-end, antialiasing, and the memory-controller
As NVIDIA has designed to keep as much communication on the GPU as possible, the back-end is significantly improved when compared to GeForce GT 200. The ROPs (raster back-ends) now number a maximum of 48 and are grouped in six partitions of eight (dark-blue, just behind the L2 cache) and, NVIDIA claims, have increased filtering performance due to better colour compression, where 8x AA application takes a minor performance hit when compared to traditional 4x AA.
Launched with the GeForce 8-series GPUs, coverage sample antialiasing (CSAA) is NVIDIA's attempt to provide greater AA precision via an efficient computational cost. CSAA is now improved in GF100 with the ability to run 24 samples per pixel (in addition to the eight 'proper' MSAA samples). NVIDIA's new technique allows for 32 samples and 33 levels of transparency, enabling 'sharp-edge' objects to better blend in with their surroundings.
The ROPs then link out to the memory-controller, just as in any GPU design, but this time to six partitions that are 64 bits wide. What this means is that GF100 has a narrower memory width than GeForce GT 200, yet it will be offset by the use of high-speed GDDR5 RAM. Clearly, the balance of resources has shifted in favour of keeping as much as possible on the GPU - a common theme.
NVIDIA GPU thinking has changed significantly with the introduction of Fermi. Breaking down the huge GPU into practically self-contained mini-GPUs known as Graphics Processing Clusters, the company is showing us more than an inkling into future designs. Mini-GPUs enable particular focus on the geometry setup, and greater levels of on-chip cache indicate that Fermi has been designed with GPGPU firmly in mind.
The full-fat GF100 chip is massive; there's no two ways about that. 3bn-plus transistors are heck of a silicon price to pay for what NVIDIA hopes is Radeon-beating performance.
Based on GF100 (Fermi), NVIDIA is releasing GeForce GTX 480 and GTX 470 for the desktop, and neither will take advantage of the complete architecture, so let's head on over for a summary of the feeds and speeds.