Item: ATI Radeon X1800 XT P
Author: Ryszard Sommefeldt

ATI R520 Overview

As mentioned, this article's coverage of the technology will be light. And when I say light coverage, I mean pretty heavy, but hopefully easy enough to understand. Such is the hardware kung-fu I have available for you.

However to understand where R520 gets it performance from you should have a brief idea of how it works. Compared to previous 3D processors in recent times, say R300 onwards and including anything NVIDIA have produced since then, R520 uses the same basic building blocks that define a 3D processor in silicon, but joins them and feeds them differently. So there's still separate vertex and pixel shader units, a memory controller and memory bus, and picture output hardware to draw the image on your screen.

They're put together a little differently, though. Outwardly, R520 is essentially, and ATI will hate me for saying so, little more than a highly clocked R480. The basic pixel processing hardware is pretty much the same in terms of the arithmetic work it can do on a given pixel, per clock. There are two more vertex units than before, but all share the same basic vertex processing capabilities as R480's vertex shader hardware.

A new pixel thread scheduler

The big differences start with the front end to the pixel processors that decide how, what, in what order and when pixels are to be processed by each of the pixel shader processors. The hardware implements a thread scheduler to process batches of pixels in very small groups. Batches of pixel quads (2x2 pixel groups), in groups of 4 (one for each of the four quad pixel units on the hardware giving 16 in total) are processed in a four-deep fashion. The hardware queues up these 4x4x4 pixel groups to send to the hardware in very small threads of 16 pixels each, queued 4 deep in the scheduler.

This is in stark contrast to previous generations of hardware where the batch size for sent pixels was in the order of thousands rather than in double digits. And as with any scheduled processor, if you have to switch application context threads you pay a cost to do so in terms of processor cycles and in a 3D processor, flushing out thousands of pixels will take many thousands of cycles to happen to completion. With 520, the small batch size lets a full app context switch happen quicker, along with single cycle thread switching for pixel threads.

ATI argue that in a Shader Model 3.0 GPU, where one of the main new things that SM3.0 bring is dynamic branching in the pixel shader, a small pixel batch size is key to offering good branching performance for that feature. The scheduler in a full R520 manages 512 of those small pixel batches, as threads, at any one time. Efficiency, if done right, can go up in that respect, both in terms of branching in your shader program and overall when keeping the pixel shader units busy.

New capability for the pixel shaders

When actually processing the pixel threads, the pixel shader units have grown in processing precision. Whereas R480 used 24-bit internal precision, R520 and friends are 32-bit across the board, in all processing modes. On something like NV40, you'll see the pixel hardware has many possible precision modes it can operate in when processing pixels. R520's pixel shader core was designed from the ground up to not need any partial or reduced precision modes at all, sticking to FP32, or 32-bit floating point, throughout the entire pixel unit.

In terms of hardware instructions the pixel shader units are able to issue, the pixel hardware is still resolutely 'vec3 MAD and scalar, plus vec3 ADD and scalar'. In laymans terms, the pixel units each have the ability to process a MAD and ADD instruction in parallel with each other, on vectors that have three components, along with two scalar instructions. Those '4D' processors (3 parts vector, 1 part scalar), of which there's two per pixel unit, are what operate on pixels in the full FP32 precision mentioned above.

Don't worry if you don't understand what a vector or scalar is in terms of a graphics processor. If you want to learn that, I'll explain it all in the second article should you fancy some further reading. The upshot of it all is that those capabilities determine the shader rate of the new chip. In years gone by, 3D hardware was usefully ranked in terms of texture rate. It's now shader rate that matters. Here's why.

Decoupled texture processors

In most architectures, even recent ones, texturing - the ability to address a texture you want to use to paint a pixel, and sample from it to apply the paint - is closely coupled to the majority of the 3D processing hardware in a 3D chip. In something like ATI R480, or NVIDIA NV40, the silicon that can texture is part of the pixel shader hardware. A texture address processor and texture sampler are combined to form a texture unit that each pixel unit can access on its own, in private. No texture unit from one pixel unit can be used by another.

In R520 that's different. The chip maintains the same count of texture address and texture sample units, but removes their tie to the pixel hardware. Pixel threads that need to texture can do so independently. In addition, texturing operations can be scheduled independently of pixel arithmetic math ops (the vector and scalar processing discussed above). In a 3D chip, texture operations are often something that take many more cycles than math ops.

So to hide that latency in waiting for a result from the texture op, you can schedule a bunch of arithmetic operations alongside it in parallel to hide the texture sampler latency. Do some more work while you're waiting for other work to finish, so you're not idle, right? While a traditional pixel unit can do so, it can only do so in its own little space. R520 can schedule texture sampling latency to happen independently of any other math ops going on being generated by the pixel threads. So an extra layer of abstraction in order to more effectively schedule and hide sampler time.

Easy stuff, I hope! Do more useful stuff while other stuff is waiting to finish, in your pixel hardware. In terms of quality, the hardware supports better anisotropic filtering quality which I'll explain in more detail in the technology discussion.

Pixel output hardware changes

R520 when compared to R480 also has extra abilities when outputting pixels to your display. The differences mainly lie in how the chip can make the image look better. There's improved antialiasing (although not in all ways, which I'll explain in the other article) and resolved pixels are pushed out to your screen via the Avivo display and video engine I took a look at for HEXUS in this article not too long ago.

Feeding everything with a new memory controller

Along with the thread scheduler, the cornerstone of the R520 in terms of performance is its brand new memory controller. The memory controller operates in a ring fashion with two 256-bit wide rings in two directions. 8 32-bit wide memory channels connect the memory controller to the memory chips on the board via four 'ring stops', basically points on the bus connected to both rings, two channels to a ring stop. There's a fifth ring stop that offers memory access to things like the PCI Express bus.

Anything that wants to connect to the bus is called a client, the memory controller offering 8 client interfaces (each with one read path, one write path) for memory access. Depending where on the ring the client request for memory access, be that read or write, comes in, the request is passed round the ring in the shortest path to the memory channel that contains the data location to read or write from. So the longest path from any client to any memory channel will only be half of the ring, with half of the ring to return the result to the client.

There's a bit more too it than that, including cache requests and layouts, clock scaling and a bunch of other stuff that I'll outline in the technical analysis. In short, the ring bus allows almost the full bandwidth available from the memory configuration when streaming data from the memory chips, due to its design.

R520 in numbers

For most, the easiest way to look at a GPU is in terms of numbers. Here's some useful metrics to help you compare R520 to R480 and NVIDIA G70. I'll use the top retail SKUs for each to do so, so Radeon X1800 XT, Radeon X850 XT and GeForce 7800 GTX. I've emboldened the 'best' value, where applicable.

Metric / GPU SKU	ATI Radeon X1800 XT	ATI Radeon X850 XT	NVIDIA GeForce 7800 GTX
Process	90nm @ TSMC	130nm @ TSMC	110nm @ TSMC
Core clock	625MHz	500MHz	430MHz
Memory interfaces	2 x 256-bit bi-ring 256-bit 8-channel external	256-bit crossbar 256-bit 4-channel external	256-bit crossbar 256-bit 4-channel external
Memory frequency	750MHz GDDR3	500MHz GDDR3	600MHz GDDR3
Pixel units	16	16	24
Vertex units	8	6	8
ROPs	16	16	16
Theoretical Shader Rate Main ALUs only	20G inst. sec	16G inst. sec	20.6G inst. sec
Theoretical Texel Rate	10Gtexels/sec	8Gtexels/sec	10.3Gtexels/sec
External Memory Bandwidth	48GiB/sec	32GiB/sec	38.4GiB/sec
Board Memories	512MiB 8 x 512Mb GDDR3	256MiB 8 x 256Mb GDDR3	256MiB 8 x 256Mb GDDR3

Hardware Summary

If there's one word to define R520 from a performance point of view, it'd be efficiency. The new memory controller and small batch scheduler for pixel threads are designed to keep the hardware as busy as it possibly can. Work done per cycle is always maximised wherever possible, branching performance is improved for Shader Model 3.0 and keeping the hardware more busy, more often, is the mantra.

You'll notice that Radeon X1800 XT gives up peak shader rate to GeForce 7800 GTX but dominates it in terms of memory bandwidth. X850 XT is outclassed totally due to clock rates.

Then there's the image quality, something to go over in the technical evaluation. R520 and friends bring the possibility for significantly enhanced 2D and 3D image quality compared to anything else available. Teasingly, I'll save most of that for the tech eval, though. The biggest thing we'll discuss there is the ability for R5-series GPUs to multisample floating point rendertargets. High dynamic range rendering and antialiasing can coexist if the application author is suitably smart enough.

Enough chatter in text, fancy feasting your eyes on some pics? Of course you do.