NVIDIA G80 - High-level overviewThe following diagram, courtesy of moi at Beyond3D and part of the deep and dirty hardware analysis done over there, doesn't contain details on caches or other memories, nor much about some D3D10 specific parts of the chip such as constant buffer sampling, GS streamout and the like. Bear that in mind before you take a look at G80, clicking for the full version.
So with that on your screen somewhere, let's explain a few things. Data flow is mostly from items 1 to 16 through the chip, following the arrows that connect functional blocks on the diagram. Input data hits the thread setup units (likely just a triplet of fairly large FIFOs that contain object elements and their attributes that need to be scheduled and processed), before hitting the global scheduler. Pre-shading optimisations are performed and data is sent to the local cluster schedulers for execution on a particular cluster.
The global and local cluster scheduler's implement G80's hardware load balancing effectively, organising data per cluster, the local schedulers sleeping and waking the threads in their control as needed and for free in any given cycle, with per-thread data kept in the per-cluster register files. The L1 store per cluster is connected to the shared L2 store (a combination which allows cross-cluster communications) and finished data from a cluster flows to the ROP for further processing, or back to the global scheduler (via on-board VRAM if needed) for further assignment to other upcoming threads.
Shading takes place on 128 scalar ALUs, grouped in clusters of 16 (4 quads effectively, as a concern when pixel shading), each cluster capable of 16 scalar MADD+MULs per cycle, 16 scalar interpolations or 1/4 of 16 special function (SF) ops, and 8 bilerps of data filtering (4 pixels of address and pixel setup), all in one clock. That shading, interpolation/SF and data fetch and filtering setup is the core of G80. Each SP is basically a slot waiting to be filled in any given cycle, G80 doing away with the fixed-ratio vector processing of all its consumer forebears.
It's therefore in strict contrast to every non-unified, fixed-ratio vector-based GPU ever unleashed into the consumer space. The target is clearly D3D10 (witness the geometry shading thread phase at the front of the chip in the diagram). The chip maintains a large in-flight thread count in order to keep the hardware as busy as possible and hide texturing latency as much as it possibly can, and thread can be swapped onto and off of the clusters for free, new threads always ready on a per cycle basis.
The ROP -- capable of brand new methods of AA and at higher quality and rates than any other single GPU to date -- and memory controller will do their thing before outputting finished data to regions of VRAM, be they the backbuffer or some other intermediary space for further processing elsewhere (back on the CPU, back into the GPU as a texture, whatever the developer asks for as constrained by their API), and then G80's companion IC, NVIO (not shown) will take care of any display on consumer hardware. NVIO-less G80 is of course a proposition to consider, but for a different article.
That's basically it (and a testament to how clean the design is in places). What you see in front of you there is a hardware-level load-balanced, completely scalar and completely unified shading architecture for D3D9 and D3D10. The first for the PC in 3D history, too. The diagram contains some unit counts and rates that are relevant to talking about the chip in terms of numbers, so we'll sum those up for you on the next page. This is the exciting bit for most folks, so let's be cute and compare it not only to NVIDIA's last generation high-end GPU, but ATI's as well.
If you want the deep nitty gritty of the G80 architecture and NVIO then we've got the analysis for you waiting at Beyond3D.