vacancies advertise contact news tip The Vault
facebook rss twitter

Intel lifts wraps on next-generation Larrabee: should NVIDIA and ATI be worried?

by Tarinder Sandhu on 4 August 2008, 07:58

Tags: Intel (NASDAQ:INTC)

Quick Link: HEXUS.net/qaonf

Add to My Vault: x

The architecture

Overview

Please bear in mind that this is a high-level overview, barely scratching Larrabee's surface. We'll investigate further when Intel spills a few more details.

Larrabee is designed to compete against NVIDIA and ATI (AMD) graphics cards, primarily at the high-end, and is scheduled to be released next year. Intel hopes it will take a large slice of the high-end gaming and GPGPU pie.

Physically looking like any pretty much any other card out there, it will hook-up via a PCIe interface and support a dedicated, high-speed framebuffer. The devil, of course, is in the details.

Essentially, Intel's Larrabee attempts to solve the problem of massive throughput needed for next-generation VPUs by using the good ol' x86 microprocessor in a many-core configuration, along with dedicated logic - co-processors, if you will - where it's deemed to be more efficient at completing set tasks.

Importantly, the Larrabee design includes a flexible, math-intensive 16-wide vector processing unit to increase the number of instructions processed per clock. The unit is special insofar as it can be loaded one lane (instruction) at a time, without the need to fill all 16 slots before execution can take place. 

On first glance, it kind of reminds us a little bit of Sony's Cell processor; insofar it's a one-fits-all architecture, seemingly.



The guts

Specifically, each Larrabee core, and no hard-and-fast numbers were put forward by Sieler, is based on the old in-order Pentium architecture. Intel, though, has added multi-threading, 64-bit support, improved prefetching, and speculative loading to improve per-core performance - and it's these x86 facets that differentiate it from the stream processors in modern GPUs by NVIDIA and ATI.

All cores have access to their own 256KB of L2 cache that, unlike Sony's Cell, are coherent, and grouped together in the pool you see in the block diagram, above.

Inter-CPU communication takes place over a two-way 512-bit ringbus setup, which, on the face of it, is somewhat also similar to what ATI had with its Radeon HD 2900 XT

The fixed-function logic, which kind of goes against the grain of modern programmable GPUs, is used to process parts of the rendering setup that aren't best-run on the cores - texturing being an example highlighted by Sieler. They're not 'attached' to the cores, and thus can be bypassed without compromising the efficiency of the rendering if they're simply not needed.