R580: Memory Controller, Caches and ROPs
Memory ControllerR580 shares the same full 512-bit internal, 256-bit external ring bus memory controller as R520. Here's a short overview of how it works.
The ring bus controller allows multiple client interfaces to ask for memory requests, with writes to the DRAMs going via a crossbar switch, which then arbitrates write access to the correct DRAM device.
Read requests traverse the ring intelligently, at least as much as the memory controller has the ability to govern given its programmable interface. Given that the memory controller 'knows' where each broad block of data is and stores addresses for those blocks, it sends requests round the shortest path to each of the five ring stops present on the bi-directional ring.
Four ring stops are for the DRAM devices themselves, which connect to their stop in pairs. The fifth is for general I/O to things like the PCI Express bus and by extension ATI HyperMemory, allowing the memory controller to address those resources properly.
We'll go further in depth in a separate piece, but suffice to say the memory controller was designed for flexibility, reduced latency and scalability in terms of clock rate. Wire density is reduced because of the controller's layout, and only cost holds back the external interface, ATI showing the full double-wide variant with R520 and thus R580.
CachesOne thing we haven't touched on in our evaluation of R5-series hardware to date, is the hardware's new cache design and layout properties. Texture cache on R5-series chips is fully-associative, giving a complete map of the memory subsystem, via the memory controller, for that cache to map into. Texture cache misses are therefore lessened, texture data able to come from anywhere in card or mapped memory. L1 (there is no secondary retirement cache in R5-series hardware) texture cache size in R520 and R580 is likely 32KiB
Further, depth and stencil data is cached in the same fully-associative way, and it seems that colour buffer access is also cached directly by a fully-associative cache of its own.
ROPsThe raster output hardware is the final piece of the 3D puzzle for R580. Each ROP, of which there's sixteen, can do two Z-writes per cycle (sustainable with MSAA on, to boot) or combined Z and colour. They also have the same programmable pixel sub-sampling for multisample antialiasing (MSAA, up to 6x) as R4-series GPUs, but can also pair that with sub-sampling of alpha textures to effectively antialias those surfaces, too.
Additionally, and to create one of the defining features of the R5 generation of hardware, the ROP can also multisample from 'HDR' surface formats. 64-bit per pixel surfaces - integer (FX16) or floating-point (FP16) - can be sampled from, along with 32-bit FX10 (10bpc RGB, 2-bit alpha), L16 (16-bit integer, luminance only) and combinations of those depending on what the developer needs.
The 32-bit FX10 format is the full-speed format for 'HDR' MSAA, and the surfaces can be sampled while compressed. It's a feature not present on any competing hardware for NVIDIA and gives ATI a large competitive advantage overall in the image quality stakes. Let's talk about image quality quickly, before moving on to the physical implementation of the first R580-based SKUs.