Item: NVIDIA GeForce 7800 GTX P
Author: Ryszard Sommefeldt

G70 vs NV40

	*NVIDIA G70*	NVIDIA NV40/NV45
Manufacturing Process	110nm @ TSMC	130nm @ TSMC
Transistor Count	302M	222M
Fragment Processor Spec	Pixel Shader 3.0
Fragment Processor Count	24	16
Fragment Processor Setup	2 sub shader ALUs, each with mini-ALU, fog ALU, texture ability, per unit	2 sub shader ALUs, each with mini-ALU, fog ALU, texture ability, per unit
Fragment Processor Precisions	FP32, FP16
Vertex Processor Spec	Vertex Shader 3.0
Vertex Processor Count	8	6
Vertex Processor Setup	Texture unit (no filtering), FP32 Vec4 ALU, FP32 Scaler ALU
Texture Units	1 per fragment ALU, single-cycle bilinear, trilinear, 16X anisotropic, FP16 filtering, L1 cache. Shared L2 texture cache (16KiB estimated)
ROPs	16, each with Z ROP and compress, colour ROP and compress, multi-sample anti-aliasing (2 loops), FP blend unit	16, each with Z ROP and compress, colour ROP and compress, multi-sample anti-aliasing (2 loops), FP blend unit
Anti-aliasing abilities	Gamma-correct 4X RGMS, 2X OGSS, combination	4X RGMS, 2X OGSS, combination
Bus interconnect	PCI Express	AGP8X (NV40), PCI Express (NV45)
Core speed	430MHz (7800 GTX)	425MHz (NV45-based) 6800 Ultra
Memory speed and type	600MHz DDR GDDR3 (7800 GTX)	550MHz DDR GDDR3 (NV45-based) 6800 Ultra
Memory Bandwidth	38.4GB/sec	35.2GB/sec

Fragment processor differences

Need to brush up on how a modern immediate-mode GPU is put together, give this page a look. The fragment processor (usually called a pixel processor) processes fragments output by the GPU's rasteriser, which in turn creates rasterised fragments from the geometry spat out by the vertex hardware. So vertex hardware is first in the render chain, but since G70's main differences compared to NV40 are in the fragment units, I'll cover those first.

NV40 and G70's fragment units are made up of a pair of sub units. Sub-unit one in NV40 can texture (use 'texture' data as input to a fragment program being run by the fragment units, but it doesn't have to be a coloured image texture), and issue a MUL vector instruction or use its mini-ALU to issue a non-vector instruction like RSQ (reciprocal square root). Sub-unit two can issue a MADD vector instruction (single-cycle MUL and ADD combined) or use its own mini-ALU with the same capability as the mini-ALU attached to sub-unit one.

G70 differs in sub-unit one, which can now issue a MADD as well. Everything else is the same in terms of ALU ability (all mini-ALU instructions are still single-cycle). So G70 widens internally with the power to run two MADDs on a pair of vec4 vectors, in SIMD. That's twice the SIMD MADD power as NV40, per cycle. NVIDIA's reasoning (which flies in the face of the reasoning they gave for not allowing sub-unit one to issue a MADD for NV40!) is that the majority of complex fragment shader programs being run today in modern and up-coming games titles will make heavy use of the MADD instruction, which can be used for calculating vector dot products (indeed, the single-cycle vec4 MADD is the equivalent to a single-cycle DP4 instruction).

Calculation of vector dot product is an integral part of many fragment shader effects that it's desirable to run on a 3D GPU. NV35 could issue two MADDs per cycle, per fragment ALU and now G70 regains that processing ability.

With 24 of those new ALUs inside a full G70, compared to 16 in NV40, there's 3 times the MADD horsepower per cycle in a full G70, compared to a full NV40. That ALU throughput is key to G70's new performance in fragment programs. Again, the units are dual-issue. So the sub-units can combine to issue two independent instructions on a 4-component input vector, with either a 3:1 or 2:2 instruction split. So, for example, the hardware can issue MADD across 3 components of the vector and square the fourth, in a single cycle.

Finally, the register file for G70 didn't get any bigger, compared to NV40. So FP32 operations might still experience register space pressure and not run at full theoretical speed.

Texture processor differences

In terms of features, the texture samplers attached to the fragment units are no different in G70, compared to NV40. However NVIDIA tell me that when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory. So while fragment programs always seek to hide texturing latency since it's always much slower than a single cycle, there's less latency to hide with G70. The samplers still perform single-cycle bilinear filtering, two-cycle trilinear and up to 16X anisotropic (128-tap) filtering, just like NV40 does.

I speculate that the L2 texture cache has increased in G70 (to 32KiB, from 16KiB in NV40) to go with the increase in fragment units, and it's been tweaked for better cache reuse with larger textures, decompressing those larger textures into L1 faster as well as possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering.

NVIDIA have added support for ATI's 3Dc compression format via the driver, which converts the 3Dc 2-channel format into the V8U8 2-channel, 16-bit texture format. The GeForce 6 series of products, based on NV4x GPUs, gets the same functionality. Finally, the texture filtering hardware still supports accelerated DST rendering, as made famous by 3DMark05.

Vertex processor differences

There are two more units in a full G70 (as you get in GeForce 7800 GTX), each with FP32 scalar and vector ALUs, and a texture sampler (no FP filtering and needed to comply with the VS3.0 spec). Since they're VS3.0-capable vertex processors, they can do dynamic branching for the vertex programs. The vertex unit can issue a single cycle vec4 MADD per clock and scalar performance from the FP32 scalar hardware is apparently up by 20%.

So 33% more vertex units, each with more performance, with a presumably tweaked vertex fetch unit (although NVIDIA won't confirm either way) to go with it, per clock. The triangle setup and rasteriser - hardware that feeds the fragment units - is apparently optimised via the use of a new raster pattern, but NVIDIA don't go into details.

Raster output differences

The ROP hardware gets a tweak for G70, too. It can now do gamma-correct multi-sampling entirely in hardware, whereas I think NV40 did some pre-processing for gamma correction in the fragment hardware for GeForce 6 and FX 4400. It can also anti-alias textures with an alpha components used for transparency effects. NVIDIA's favourite example is the chain-link fence: the fence isn't made up of geometry, rather it's a texture with some see-through parts (using the alpha component of the texture surface). G70 can sample inside a pixel for alpha, anti-aliasing internally to the pixel if it detects alpha. Doing so lessens performance slightly, per clock, as memory bandwidth is consumed.

The ROPs still have a two-loop limit for Z sampling, with up to two Z samples per cycle. So the hardware limit of 4X multi-sample anti-aliasing remains (2 sample per cycle, two passes through the Z ROP). NVIDIA are loathe to mention if their Z and colour compression schemes have had any tweaks, although if they have I assume they're fairly mild. The ROPs, which run at core clock in G70, as they do in NV40, can still blend floating point rendertargets and off-screen buffers, but they can't be multi-sampled, NVIDIA choosing not to spend transistor budget in the same way that ATI appear to be doing for their next-gen PC hardware. Alpha-to-mask MSAA is a decent trade-off compared to full MSAA of float targets, though. Any increase in IQ when anti-aliasing on NVIDIA hardware is welcome, given their stiff competition in that respect.

Like NV43 and similar, G70's ROP count is less than the fragment unit count. While G70 can theoretically output 24 pixels from its fragment hardware, per cycle, there's only 16 input buckets (the ROPs) to dump those pixels into, for further processing. NVIDIA's reasoning is that when you're shader bound, pixels output per cycle rarely hits that peak. A ROP count of 16 lets NVIDIA save transistors. Should more than 16 pixels be ready in a cycle, they're buffered until the raster output hardware is ready to work on them. Peak pixel fillrate is therefore nearly identical to the 5MHz slower 6800 Ultra, which shares the ROP count of G70.

In English, guv?!

I can understand why you wouldn't want to digest all that, so certainly! The basic differences are as follows: Compared to a full NV40, a full G70 can do more MADD-based FP vector math (up to 3 times) per clock. It can fetch larger textures faster (and the L2 texture cache is twice as big) and filter FP textures quicker. It has 33% more vertex units, each with more performance per clock. And it can now anti-alias alpha textures, all gamma correct.

TSMC's 110nm process is used to cram in 80 million or so new transistors, mostly made up of the more and improved fragment hardware, another pair of vertex units and the tweaked ROPs. With the not-the-top-end GeForce 7800 GTX clocked faster than the 6800 Ultra in both core and memory, all that new FP vector ALU power is clocked even faster than it would have been before. And that's where the majority of G70's new performance lies. MADDness (and I bet I'm not the only one to make that joke today!).

And if you're a user who looks at high-end board performance as a means to predicting what you'll get in a low-end board, the G7x architecture supports TurboCache just as NV4x does.

Fancy a peek at a board or two? Yes, that's right, lucky buggers like me got a pair.

NVIDIA GeForce 7800 GTX Preview

G70 vs NV40

Fragment processor differences

Texture processor differences

Vertex processor differences

Raster output differences

In English, guv?!

MY HEXUS

EVENTS

INDUSTRY PRESS RELEASES