90nm NVIDIA G7x ArchitectureLargely inheriting from G70, the new 90nm G71 and G73 chips share a slightly evolved architecture. Here's the technical overview. Please skip if it's not your thing and we'll pick up the story after the innards of the chips are discussed.
Vertex processorsStarting from the beginning, the vertex hardware is unchanged. 8 MIMD (multiple instruction, multiple data) vertex processors are arranged in SIMD (single instruction, multiple data) fashion, each capable of a 5D vector MADD instruction per clock. Those fully FP32 vec4 + scalar units allow a full four-component vertex transformation per clock cycle, per unit.
Each processor supports the DirectX9 Vertex Shader 3.0 specification, including support for texturing. A 32-bit texture address unit and texture sampler combine to allow that functionality. The sampler supports simple point sampling without any filtering. All supported shader instructions are single cycle and the units feed into a high-speed rasteriser.
RasteriserG71 shares G70's raster pattern and operation, generating pixel fragments to be shaded from input geometry data. The raster hardware runs at the same clock as the (pixel) fragment shaders, and is able to reject 256 pixels per clock. Fragments output into a FIFO (first-in, first-out) buffer ready for the fragment hardware to process. The FIFO matches the batch size of the FP units at just over 1000 fragments (256 quads).
Fragment processorsThe fragment hardware is 4D-wide, offering two vec3 + scalar MADDs per cycle, per fragment, operating in quads. The units support the DirectX9 Pixel Shader 3.0 specification, including support for dynamic branching. Each unit is made up of two sub ALUs (arithmetic and logic units), each capable of the aforementioned 4D MADD instruction issue, along with issue of a variety of other instructions dedicated to fulfilling the PS 3.0 spec.
Sub-unit one is also the texture address processor, so while it's doing texture address it can't issue a shader instruction. Sub-unit two can shade while unit one is texturing though, unless the texture sampler is filtering anisotropically, in which case the entire FP pipeline is stalled pending the result.
The sampler hardware is capable of single-cycle bilinear filtering, taking four texel samples per cycle. Additionally, trilinear and anisotropic filtering are supported, the hardware offering 16x aniso (128-tap) maximum, filtering heaviest at 45 degree angle multiples. More on that later.
FP quads operate in instruction lock-step, each processing the same pixelshader or texture instruction per cycle. Each quad possesses a local 'L1' texel cache and they all share an 'L2' cache. Texture cache associativity remains the same as NVIDIA's 110nm NV4x and G70 products, too. G70's improvements when accessing large texels from cache, compared to NV4x, are present and correct as well.
Each quad of units is able to output four processed fragments per cycle. NVIDIA gamble on that not being the case in the majority of situations, supplying a lesser number of individual ROPs able to work on waiting fragments for their 90nm G7x endeavours.
The imbalance of FP units to ROPs relies on the use of pixelshader programs that, through the use of texturing and a large number of shader instructions, can't realistically output their block of four fragments per cycle anyway.
So after processing using those basic available resources, finished fragments are fed into a small waiting FIFO (likely less than 100 fragments deep) - which when filled will stall the FP pipeline - ready for the ROP hardware to chew on.
Render Output Units (ROPs)The collection of ROPs are what process finished fragments ready for output into the framebuffer. The ROP bank has access to the rasteriser, able to tell that hardware to discard geometry that won't make it to the ROP, saving fragment rendering power and bandwidth.
With the Z and colour buffers the largest consumers of on-board memory bandwidth, they each have their own caches that the ROP can read from, and the ROP supports lossless compression of both colour and Z. Each ROP is actually made up of two sub units, each able to do Z and colour (C) ops.
This is contrary to NVIDIA's published documentation on the ROP hardware which states that the two sub units have Z/C and Z ability only, respectively. Therefore each singular ROP pipe is capable of a double Z or double colour write rate to the framebuffer, when the other is masked off in the shader renderstate.
The concept is largely explained by each ROP having an internal bandwidth for operations, balanced with the need to consume on-board memory bandwidth while working. If there's available internal bandwidth and sufficient output bandwidth to the memory modules, with both needed for sustaining a double rate for Z or C writes, it can do the work.
Each ROP is responsible for antialiasing, too, with the units sharing their G70 brethren's ability for up to 4x multisampling (taking 2 cycles), transparency antialiasing (multisampling and supersampling) for AA of fragment interiors, and gamma correction of samples before pixel resolve. Each ROP contains a blend unit, too, which has increased performance compared to its G70 brethren, especially when blending pixels in FP16 rendertargets.
Pixel resolve finishes off rendering and outputs a colour value to the frambuffer for showing on your display device. Output is done via a set of 64-bit memory channels to a 256-bit internal memory bus. And et voila, rendering is complete. That entire pipeline, with entire tomes possible on the nuances and specifics of the described internal archictecture, is what powers 90nm G7x.
SLINVIDIA's Scalable Link Interface technology is supported in the new 90nm G7x chips with two SLI interfaces. Each bi-directional interface is clocked at 1GHz and is 8-bits wide, able to transfer a single byte per cycle. Each interface is slightly reworked in terms of memory connection, with the end result that the interfaces can transfer antialiasing sample data. SLI AA performance should therefore be usefully up compared to G70-based boards.
SummarySo it's the same basic architecture as G70, at least in terms of vertex and fragment processing. Rendering can be 16-bit floating point per channel throughout the entire render pipe if needed, the hardware supports a large range of flexible surface formats for rendertargets and other texturing, texture compression is supported via a variety of 1- and 2-way formats and the entire chip is setup to be predictable and programmer friendly.
SLI performance should be up, FP16 blending performance should be up, and the dual MADD issue per cycle in the fragment hardware (when not texturing or filtering) is largely what made GeForce 7800 GTX so great on introduction, and 7800 GT and 7800 GTX 512 good value and high performance respectively when they were introduced. G71 doesn't throw any of that away.
The entirety of Shader Model 3.0 is supported in usable working format with no workarounds, and really the only downsides to the chip are in terms of image quality.
The hardware can't natively multisample from floating point surfaces with a per-channel range of greater than 8 bits, which enables easy support for larger dynamic ranges in rendering, enabling 'HDR' rendering techniques without compromise or extra work.
The hardware is also lacking, when compared to the competition at least, in terms of multisample quality (only four subpixel samples per ROP with fixed positions) and texture filtering (large tap counts possible but only on fixed, limited angles).
It seems a slightly unbalanced architecture on the surface of it, with a surfeit of texture sampling ability, possibly without the larger memory bandwidth to fully service it when fully stressed (depending on clocks and SKU). But with a high performance texel cache and the rest of the chip well setup for fast FP processing, G7x realistically isn't deficient architecturally in any area other than final IQ.
Let's have a look at G71 and G73 to see how they implement the basic building blocks.