RV515If you take and understand R520 as the base architecture, thinking about RV515 is really quite easy. Pretty much one quarter of an R520 in terms of functional units and processing resources, RV515 starts with two vertex shader units.
Vertex and fragment processingEach vertex shader unit is capable of a vec4 instruction issue per cycle and one scalar instruction, making the unit '5D' wide.
The dual-issue ALU can issue an FP32 MADD across the vector portion of the hardware, giving it the ability to perform a full 4x4 vertex transformation in a single cycle, as is common with all programmable VS hardware. Being ShaderModel 3.0 compliant, it supports dynamic flow control and more temporary space than a VS2.0-compliant unit. However, ATI don't attach texture address and sampler ability, including register space and other logic, to the VS hardware in R5-series hardware, meaning the hardware doesn't support what's commonly called vertex texture fetch.
Moving on to the fragment hardware, RV515 has one 'pixel shader', made up of four fully-FP32 PS3.0 ALUs, more commonly called a fragment quad. These shader ALUs are 4D wide (vec3 + scalar) and made up of two sub units. Sub unit one can issue a vec3 ADD and scalar instruction per cycle in tandem with sub unit two, which itself can issue a vec3 MADD and scalar in the same cycle, dependant on the input from unit one.
Being PS3.0 compliant, the fragment processors also support flow control and looping via a dedicated branch unit. Texture processing wise, RV515 has four texture address units paired with four texture sample units, to create four complete texture processors, each capable of a single-cycle bilinear texture filter. Instructions to the texture hardware can be issued separately from any instruction processing taking place in the fragment hardware. While there's a one-to-one mapping of texture processors to fragment processors, the disconnect in processing affords ATI the ability to hide texturing latency via use of an efficient resource scheduler.
Ultra-threading in RV515We've yet to go really in-depth on the 'Ultra-Threaded' nature of R520, RV515 and RV530, but suffice to say the new scheduling and allocation hardware has the ability to more effectively use the execution resources of the chip.
The hardware RV515 maintains the same fragments-in-flight count of R520, per quad. 128 possible threads control a quad of fragments at a time, four deep, for a total of 16 fragments in flight at a time.
Branching penalty is minimised in the R5-series of GPUs by using the branch controller to schedule the correct instruction group for the thread being processed, computing both branches for all fragments if needed, using a write mask to affect the correct results being written to the GPU's register array. Processing fragments in such small batches means the compute time wasted processing two branches is minimised, whereas while the actual branch penalty is low on other hardware - in the order of a few cycles - the batch size is not.
However, maintaining such a large pool of threads per fragment quad means a large register file that helps maintain thread state. RV515, like R520, maintains space for two registers per fragment, per thread. Maintaining a quarter of threads therefore means that space is a quarter of the size at 64KiB.
The register space also has access to the lerp hardware the rasteriser uses to generate fragments, to interpolate between registers when fragment processing. In the case of register file pressure, the hardware will reduce active thread count until the pressure backs off, so that register space is always usable.
So to sum up, there's an increase in die space because of the new way to maintain threads and maintain switching performance, but a negligible branch and management penalty results, should a thread need to be swapped on to or off of the single quad.
Don't worry if none of that makes sense, just be aware that R5-series hardware maintains tiny pixel batches, mainly to help branching, and that thread scheduling is used to hide texturing latency.
Drawing pixelsRV515 has four R520-class ROPs, each capable of 6x anti-aliasing (six subpixel samples, two per cycle), single Z-only rate and output of two complete pixels per cycle. Additionally, the ROP hardware can multisample from 16-bit floating point surfaces maintained and supported by the chip. Multisampling of FP surfaces allows antialiasing when the final rendering surface is floating point, something commonly used to make programming for higher dynamic colour ranges easier.
Memory busWhile R520 has a dual ring bus memory controller, 256-bits wide per ring, RV515 uses a conventional crossbar controller, 32-bits wide per crossbar partition and 128-bits wide in total. The ring-bus controller is some 30M transistors on R520, so a complete transplant into the 100M transistor RV515 (and into is correct since the controller transistors are largely in the die centre) would increase its size by at least 20%.
Summary and die shotRV515 also features, for the first time this hack can remember on the lowest end ATI part, their hierarchical-Z buffer. As far as texture cache size goes, the smart money is on 8KiB.
Lastly, the hardware implements a method to bilinearly fetch texels from single channel surfaces and pack the fetch into a four channel result, something not implemented in R520!
|ATI RV515 GPU Properties|
|Process and Fabricator||TSMC 90nm|
|Transistor Count||100 million|
|DirectX Shader Model||ShaderModel 3.0|
|Basic Configuration (VP/FP/ROP)||2/4/4|
|Vertex Shader Info||VS3.0 (no texld)
5D FP32, co-issue MADD, branch
single cycle trig functions
|Fragment Processor Info||PS3.0|
4D FP32, dual-issue ADD+MADD, branch
single cycle trig functions
|ROP Info||6x FP MSAA (2 subsamples/cycle)
1x Z-only rate
|Texture processing||4 FP32 address units, 4 samplers
Bilinear filter for integer samples
|Memory Interface||128-bit, 4 x 32-bit partitions, crossbar
|Display output||2x dual-link DVI TMDS transmitters