After a year of less than stellar CPU launches (Bulldozer) and the fallout from the massive company-wide layoffs and manufacturing delays, AMD and their fans finally have something to cheer about this holiday season. Built on the radical 28nm 'Graphics Core Next' architecture and numerous another innovations, the US$549 Radeon HD 7970 flexes it considerable muscles in the high end enthusiast GPU market. In the first of a three part review, supercomputing expert and in-house Serbian Nebojsa Novakovic gives his take on what the new architecture will bring..
When AMD unveiled its GCN – Graphics Core Next – new generation graphics processor architecture as a part of its FSA – Fusion System Architecture – at the last June's Fusion Developer Summit in Seattle, it was quite a surprise: the next-in-line future GPU was to be transformed into a general purpose compute coprocessor, with far more flexibility and aim at wider range of applications well beyond graphics only – and, not to forget, easier memory space sharing with the CPU. Half a year later, the first implementation of this new architecture already sees the light.
The Radeon HD7970 is the fastest single GPU card today in most of 3-D graphics runs, but its compute power impresses, too: 3.8 TFLOPs peak single precision FP and, more importantly, 947 GFLOPs double precision FP at the defauly 925 MHz clock. Since the card overclocks easily by another 10% or so without even changing any voltages, you can, for the first time, easily have a true (peak) Teraflop DP FP engine in the PC. The 384-bit GDDR5 memory path with 3 GB RAM gives some 260 GB/s bandwidth to feed all that performance, with sufficient local memory for larger datasets than before. And, in those moments when the GPU does nothing, the idle power drops to below 3 Watts – not bad at all.
How did it achieve all this performance? The first implementation of GCN is based on the same 'Compute Units' that we described post-AMD Summit in June, 32 of them in this chip. Each compute unit, able to execute code from multiple kernels at once, is a processor by itself, in a sense a full-fledged 'core', the way we describe cores in general purpose processors. So, each compute unit has a vector and standard scalar processor core, plus a texture block with filtering and fetch units, all this with local registers, 64 KB data share memory and 16 KB L1 cache with 64 bytes/clock bandwidth. The branching and scheduling units complete the picture.
The cache hierarchy is intriguing here: each set of four compute units shares an extra 32 KB scalar data cache plus 16 KB instruction cache, backed by a common L2 cache, which totals 768 KB in the HD7970 chip across all compute units. This L2 then interfaces to the memory controller. A Global Data Share unit is there to enable sync among all compute units at the L1 cache level. This is far more complex cache architecture than what you'll see in today's general purpose CPUs, by the way.
In addition to compute units, there are still dedicated processing elements for 3-D graphics. Dual Geometry Units, as well as eight render back ends, plus the usual video acceleration hardware, complete the chip here. And yes, the system interface is PCIe x16 v3, finally.
Now, how does GCN in HD7970 achieve, as AMD claims, much better actual use of the underlying resources compared to the previous generation? First, you'll notice that the chip architecture looks far more symmetric and simpler to understand than the GPUs of the past, which should translate to the programs being able to use it more efficently. Also, the separate dual asynchronous compute engines help schedule multiple tasks in parallel with the graphics command processor, while dual direct memory access (DMA) engines help in fetching and sending data fast, able to saturate 16 GB/s over PCIe 3 to the system itself.
In some tests, like AES256, the resulting improvements in actual performance are over four times, but even many others have speed jumps higher than the theoretical FP speed boost from the 6970 to the 7970. Two times is a regular occurence in many GPU compute benchmarks here. And, very critical for the computing usage, the FP here is fully IEEE754 compliant while, if aiming for workstation or compute server use, you have ECC protection all the way, for both DRAM and SRAM memory.
The new architecture should finally enable easy GPU multitasking, not just many processes on one GPU, but also one task being spread across multiple GPUs, something that HPC users would welcome a lot – and high end PC users as well, since, say, 4 GPUs in Quad CrossFire could be used for more than just gaming FPS.
In summary, GCN in AMD Radeon HD7970 went a step further than Nvidia did in its current 'Fermi' GeForce generation in getting the GPU to become a more versatile system compute coprocessor. There are still further steps to take in getting the GPU even closer to the CPU, including the memory sharing and interconnect, however the improvements seen in this brand new chip should be a good note to both Nvidia and Intel, the latter with its Knights Corner accelerators, on the way forward.