facebook rss twitter

GPGPU: far more important than you think. Think 10x a CPU's performance

by Tarinder Sandhu on 13 February 2008, 11:45


Quick Link: HEXUS.net/qalon

Add to My Vault: x

ATI and NVIDIA's incessant one-upmanship and desire for market domination has been the main catalyst for the astounding increase in pure power of cutting-edge GPUs (Graphics Processing Units).

Set to a backdrop of industry-standard APIs such as Microsoft's DirectX, GPUs have been transformed from fixed-function compute units to increasingly programmable models, where developers can tap into the ultra-wide architectures as they best see fit.

Modern GPUs leverage the fact that rasterisation - the process by which pixels are painted on your screen - is inherently parallelisable: you chuck data streams down a number of pipes - and the more, the better - wait a while, and then post-process.

This make-it-as-wide-as-possible approach, together with a flexible, programmable architecture, makes NVIDIA's G80/9x and ATI's R(V)6xx thoroughbred GPUs with masses of potential throughput, hiding latency via sheer concurrent computation.

Thinking about the throughput, as defined by the basic building-block that is the multiply-add rate, and therefore taking the floating-point performance into account, ATI's Radeon HD 3870 X2, released last month, can hit around 1TFLOPS (one trillion floating-point operations per second). Its 320 stream processors combine for massive parallel computation. Further, relatively massive bandwidth keeps the stream processors chugging along with data, too.

In contrast, though, a high-end quad-core CPU can push out around 60GFLOPS, or one-sixteenth the amount of floating-point power.

Now, pure floating-point power is only one metric in evaluating overall performance of a particular kind of processor, but as a means of a finger-in-the-air evaluation at just how good a particular SKU will be with respect to parallelisable workloads, mainly scientific-based, the FLOPS rating suffices.

These GPU FLOPS monsters are powerful enough to run more than just gaming code at high resolutions and high degrees of image quality. Given the right kind of workload, which is inherently and ostensibly parallelisable, they can be programmed such that performance is significantly better than, say, running the same dataset on the CPU.

Game developers cite examples such as complex physics calculations and cinema-quality effects, but there is now a nascent industry which seeks to leverage floating-point power for non-graphical purposes. You may know this as GPGPU.

GPGPU (General Purpose (computation) on Graphics Processing Units) looks at a myriad of non-graphic-related tasks and evaluates how they can, if possible, be run faster on GPUs.

Big Business exploits GPU power by utilising them in complex calculations for oil and gas exploration; medical imaging; weather analysis; general computationally-heavy research; and complex engineering modelling, to name but a few.

Writing efficient, tight code for high-performance discrete GPUs - which, remember, are from either NVIDIA or ATI - requires access to the inner workings of the architecture and, really, an easy-to-program language and abstraction layer which translates code into GPU-talk. Big Business can afford to commission its own libraries and third-party tools, though. Big Business can also afford to buy the FireStream 9170 Stream Processor, pictured below. Note that there's no reference to it being based almost exclusively on the Radeon HD 3870 graphics card, albeit with a larger frame-buffer and 10x price.

AMD disclosed that its Stream Computing initiative - the term for programming non-gaming-related code on GPUs - will make it easier for Joe Bloggs (with a big brain) to get to the heart of its GPUs and code for GPGPU use by providing a range of open-source compilers (Brook+), and libraries - gratis. These fall under the FireStream SDK (Software Development Kit) that's gives the developer low-level access to the workings of the GPU, allowing for granular control, and the tools by which to program. If you know what you're doing, and plenty of bright sparks at scientific institutions and universities do, you can write your own code for ATI cards.

Understanding that the research community may well take advantage of the floating-point power on offer and writing code is now made easier as access it 'closer to the metal', how does it help you, the average consumer?

As it happens, many media-rich applications' code is wonderfully parallelisable, meaning that, if coded for, additional plug-ins could take the load off the CPU, place it on the broad, broad shoulders of the GPU, and execute in a more-timely fashion. As an example, during the CTO conference in Amsterdam, Holland, a high-definition clip was transcoded to MPEG2 HD. Running on a Radeon HD 3870 X2, with the Adobe Premier plug-in activated, the GPU ran at 4x a quad-core CPU's speed.

Now, it's clear that extensive developer-relations' support needs to be fostered and promoted before large software houses bother to take advantage of the GPU's massive horsepower, but we see GPGPU as more than just a fad. Rather, the burgeoning industry needs to be helped along via close collaboration with NVIDIA and AMD.

Transcoding HD material or delving into areas such as voice-recognition is still a time-consuming affair and any attempts to leverage the GPU for such tasks is nothing but sensible practice. It opens up the door for AMD to sell FireStream-branded cards at a wonderful premium, too.

After talking to Mike Houston, senior software architect for AMD's Stream Computing initiative, ATI had consciously designed present and upcoming GPUs with GPGPU usage firmly in mind. The Radeon HD 3870 is outfitted with double-precision floating-point accuracy - a feature that's not required when executing gaming code. Rather, it's useful in mission-critical instances where computational accuracy outweighs pure speed.

So the next time you look at the Radeon HD 3870 or NVIDIA GeForce 8800 GT in your machine, remember that it's more than just a graphics card; it's a floating-point monster that will increasingly be used for non-graphical tasks. We will see media-rich applications harness the TFLOPS on offer, so any initiative made to expedite and facilitate that process can only be seen as a good thing.

TeraFLOPs, who needs 'em? The answer is everyone. Bear that in mind when transcoding a two-hour HD feed or running Folding@Home.

HEXUS Forums :: 17 Comments

Login with Forum Account

Don't have an account? Register today!
done right is the key. there are a whole bunch of problems with gpgpu, not least of which is the low accuracy of numbers (32-bit floats only, with no guarantees how things will be rounded)
Only 10x?

The R580 can do operations 40-50x faster than a CPU provided the problem is something that a shader can solve.
its interesting because only a couple of years back we where told what an amazing difference these Physics Co Proccessors would make. Now the problem comes when you can't express the problem in a way that makes it easy to leverage these co-processors.

Its hard enough most of the time to make something that can be parralised accross two proccessors, let alone 16.

Also as hex said, rounding is odd with GPUs, take something we all should know, 0.1 + 0.2 != 0.3, but with GPUs, it could be a slightly different rounding.

Take a look at:
Programming in the Age of Concurrency: The Accelerator Project
This allows you to use the Pixel Shader, via direct-x from .Net code.

Then try and think how you could use that for the average developement task.
Very interesting - I could see some applications for the tech in the work I do.. I did pick up a book on shader programming but I must admit i'd much rather have a tool that scaled the code onto the hardware automajically..
I'm currently in the process of optimizing OpenFOAM with CUDA, it really does require some thought and understanding of the architecture and application domain.

Some algorithms happily lend themselves to the parallel architecture of the GPU and are fairly good to execute on the GPU. Others clearly dont given the limited actual buffer space despite the huge transfer speeds capable on the ring-bus memory system. Also with the G9x series coming out in the near future (Geforce 9xxx), 64-bit precision wont be a huge issue either.

I'm finding alot of algorithms dont even need to fully occupy (WARP as its known in CUDA world) the ALU to benefit entirely from the GPU - a lot of operations in CFD applications are bandwidth restricted rather than processing power restricted. For example, benchmarking OpenFOAM on a quad core system that shares the bus (e.g. current Intel ones) show a nonlinear increase in performance at 3 cores, as the cores fight for bus access.

Here's an example of what algorithms clearly perform well on the GPU architecture to date : NVIDIA CUDA SDK Code Samples

As for 64-bit on current hardware (G82 etc), its so dam powerful that should you manage to spawn off enough threads per seconds to occupy the ALUs, it would easily overcome the extra cycles required to calculate at 64-bit precision.

As for Microsoft Accelerator, you'll want to skip that, it optimizes too generically for the architecture due to DX limitations. Interfaces such as CUDA (NVIDIA) / Close to Metal (ATI research) allow far better access to the chip, with less performance penalties per thread.

Finally, there are papers on IEEE 754 rounding on GPGPU architectures, which can help mitigate the total effect on your application, have a look within the ACM for these.