The massive jumps in both peak and actual usable GPU compute FP performance, especially across multiple GPUs, show that both the GPU and OpenCL have matured sufficiently to make graphics compute offload a more viable proposition down to the mainstream.
Being among the first to get a hold of the new GPU – and multiples of them at that, in this case AMD Radeon HD7970 'Tahiti' has its advantages, one of them running a lot of tests for the first time for the public to see. In our case, benchmonkey Lennard ran quite a few such benchmark tests on one, two, three and, of course, four cards in parallel. You'd have noticed that the results from the compute tests scaled just as well, if not even better, than the 3-D graphics ones, which on their own did a wonderful scaling job as well – notice even the 3Dmark and Heaven test scaling gains between one and four cards!
Putting aside the numbers, which you can see on our site and widely across the Net now as well, the impact is interesting. The AMD GCN new GPU architecture, which basically looks a lot like a vector FP processor surrounded by graphics acceleration hardware, did provide for much higher usable vs peak FP rate, especially for double precision FP critical for mainstream PC and HPC applications. On top of it, the OpenCL programming model has now matured well to handle single and multiple tasks with many threads being well balanced and spread across multiple GPUs, which the benchmark performance gains when running on up to four GPUs have shown.
One important thing here is that the GPU compute is not limited by CrossFire – or SLI, on Nvidia – four-GPU barrier. If your application, or multiple tasks, can handle it, and the underlying board has enough PCIe slots to support it, there's nothing to stop you from having, say, eight or more GPUs in a single system, all running GPU compute and/or graphics at the same time. For instance, the upcoming Xeon E5 4600 quad-socket LGA 2011 platform, slated for release mid year, will have a whopping 160 PCIe v3 lanes available direct from the CPUs, enabling 8 or more GPU cards in the system. If each of these is a, say, 1.1 GHz pre overclocked 6 GB RAM HD7970, it would mean a 9 TFLOPs DP FP capability in a single box, yet with 48 GB dedicated RAM on these GPUs for large local dataset processing without having to go to the over an order of magnitude slower PCIe link – keep this in mind, as having to often go to PCIe for slow data movement was one of key limitations in wider GPGPU spread.
In those apps where you can tolerate even higher PCIe latency induced by PCIe bridges present on dual-GPU cards, in return for higher total performance, that same Quad Xeon E5 box could take eight dual-GPU, say 975 MHz pre overclocked AMD HD7990, cards, and have 16 TFLOPs peak DP FP performance in a single box. Whether one can fit 2 x 6 GB RAM on a single dual-GPU card right now, remains to be seen, but it would help keep all those data away from having to shuffle over comparatively slower bridged PCIe.
Why do I mention pre-overclocked here? Well, it seems AMD was really conservative with clocking the new parts, as so many of them run comfortably even at 1.1 GHz without any voltage changes or such. The yields seem to be good, and there's nothing wrong in producing higher-end binned parts, combined with more memory, to breach the 1 TFLOP peak DP FP limit, which for these cards would be at 980 MHz GPU, or let's just round it up to 1 GHz, frequency. The extra performance helps get a few more FPS in gaming, but it means real, monetizable, compute performance in supercomputing, workstation and multimedia use.
Let's see what Nvidia answers in with the GK100 chips in the next few months, then followed by Intel's 'Knights Corner' compute version of the Larrabee project. Either way, with the new 'Tahiti' chips, GPU accelerated computing seems to be more efficient and meaningful.