NVIDIA's GeForce 6600 GT

Balanced Output

Returning to a full NV40 for the specifics (either Ultra or GT, as long as it's a four quad part), we see that the basic bulding blocks of NV40 are the sixteen pixel render pipes, grouped into quads (a block of four pixels) for efficiency's sake. Before the pixel render pipes sit the geometry setup engine. That's a block of six MIMD (multiple instruction, multiple data) vertex shaders which create geometry, either by emulating the fixed function transform and lighting hardware of old or by running a vertex program described by one of the new breed of high level shading languages. Output geometry is fed to the triangle setup engine that splits the geometry data and assigns it to a quad pipeline, using the GPU's scheduler. This part of the render pipe is where vertices, having being tranformed into pixel data by the setup engine, can be discarded or otherwise optimised by the Z-buffer optimisation scheme the GPU implements.

Pixel fragments are dispatched into the fragment shaders, sixteen of which are present in a full NV40. Each fragment shader has two arithmetic units, with both having slightly different capabilities. It's here that the basic grunt work of the rendering process is done. Fragment program instructions are run, the pipeline is kept busy with fragments, each arithmetic unit in the fragment shader unit is working, hopefully on multiple chunks of data in parallel, allowing the fragment pipeline to stay busy and achieve peak output. For example, one of the units in the fragment shader is able to do single cycle normalisation of data inputs. The shader compiler wants to issue those instructions while the other unit is maybe doing a texture lookup or one of the other instructions available. That's the basis of the co and dual-issue ability of NV40, where it's able to run multiple instructions in parallel. Each fragment shader has a texture sampler, able to do a bilinear texture sample in a single cycle. Since you can texture normally and now through the fragment shader, each shader unit having its own sampler is a basic tenet for NV40's performance.

Output fragments are then sent to the fragment crossbar. The crossbar, a sixteen entry device in a full NV40, splits the fragment data into bitesize chunks for the ROPs to chew on. The ROPs (render output) are the pixel engines, where pixels are blended, combined, colour compressed and sampled for anti-aliasing. Each ROP on NV40 is comprised of a Z ROP and a C (combined Z and colour) ROP, each capable of a Z/stencil buffer write, or a pixel sample for anti-aliasing, in a single cycle. That's where NV40 gets its 32 Z/stencil pixels per cycle output, useful in accelerating rendering where there's a dedicated depth pass before full shading. Think Doom3. With each ROP able to do two pixel samples per cycle with a maximum of only two loops through a ROP, per pixel, that's a maximum of four pixel samples per ROP. That's where NV40's four-sample multi-sample anti-aliasing limit is defined. While ATI's comparable architectures are happy to let you loop through the ROP three times, that's not the case for NVIDIA. Also, with the ROP fixed in its sample grids, there's no ability for what ATI calls temporal anti-aliasing in NV40.

The ROPs then send fully processed pixels, after combining whatever buffers they reside in, to the back buffer, where they're swapped to the (less memory intensive since it doesn't need to store depth information) front buffer, once per frame (if vertical retrace synchronisation is enabled) or as many times as possible per frame (if Vsync is off).

Et voila, a (basic) description of a full NV40's rendering ability. Now think about it this way. Sixteen pixel pipes, fed by six vertex shaders, outputting pixels to sixteen ROPs. NV40 can thus output, due to the ROP count, sixteen fully textured, blended and lit pixels per cycle. In a tenuous car analogy, that's sixteen columns of cars, racing down a sixteen lane highway. At peak efficiency, that's a bucketload of pixels per cycle. However, and this is the thing to focus on when thinking about NV43's design, peak efficiency is only going to happen in a number of limited scenarios.

For example, turn on trilinear filtering where you need two texture samples (and thus a loop back though the pipeline per pixel), and peak pixel fillrate is cut in half. That's straight away, before any costly pixel shading is done. Add on the ability for current titles to spend a lot of time in their pixel shader programs during rendering and it becomes clear that only in the limited single-texturing using bilinear filtering case are you going to hit sixteen pixels per cycle from the ROPs.

Mid-range die space considerations

So given that you're never really going to see peak output in today's render setups, as presented by games and other applications, what's a chip maker to do? The mid-range parts are all about volume. You want to sell as many as possible. To do that, at the basic level, you want as many GPU dies per wafer as you possibly can. The foundry can help you here, producing clean wafers with a high rate of fully working GPUs. But you can also help them out by cutting complexity out of your GPU.

So realising that peak pixel output is rarely hit in today's games, they've chopped half the ROPs out of NV40, proportionately in terms of NV43's quad pipe count.

I let you know in my early 6600-series coverage that NV43 was basically a chopped-in-half full NV40. For NV40's four quads, that's two on NV43. NV40 has six vertex shaders, NV43 has three. So far so good. But to NV40's sixteen ROPs, NV43 only has four. Think back to the car analogy a few paragraphs back. That's now eight columns of cars screaming down their highways but oops, there's only four lanes on the highway at the end and they all have to merge in. At full speed, that's a problem. You're backing up at the end of the highway and things get messy. You're buffering the cars (pixels) as they enter the new highway (the ROPs), four at a time.

But hey, think of NV43 as a highway in London. Cars don't go that quick. They loop round the side roads (think texture samplers or looped fragment shader output) a bunch of times on their way to their destination, seemingly lost, before arriving at their destination. It's no problem for the eight lane highway to merge into a four lane, since the traffic down the eight lane is rarely at its peak.

NV43 can save die space by chopping ROPs, safe in the knowledge that it's (hopefully) not going to impact performance. So while NV43 is eight fragment pipes internally, it outputs only at four pixels per clock. The basic ROP functionality is unchanged. You can do two samples per ROP and loopback once, they can blend 64-bit pixel output from large render targets and buffers and all the good stuff that NV40 lets you do. There's just half the ROP count per pipe, compared to its big brother.

But, and it's a big but, the die space saved by the ROP chop is enough for decent transistor savings, power savings (NV43M is in the works) and cost savings, letting NVIDIA hit the mid-range segment running, or so they hope.

Finally, while playing the mid-range save-die-space-whenever-possible game, NVIDIA have given NV43 a native 128-bit memory bus interface with GDDR3 and DDR memory-type support, as opposed to the 256-bit controller on NV40. NV40 isn't a bandwidth-starved architecture (unlike NV3x) so that bus width shouldn't hinder performance too much (which sounds counter intuitive if you're used to thinking about deep pipeline CPUs!).

It's seemingly a balanced architecture, in terms of performance against transitor count and GPU complexity.

Anyway, I've covered a lot of ground here, let's gather up all the info and slot it into easy to manage, easier to understand, table format. NV43 vs full NV40.

Review: NVIDIA's GeForce 6600 GT

Balanced Output

Mid-range die space considerations

MY HEXUS

EVENTS

INDUSTRY PRESS RELEASES