We examine AMD's next-generation GPU architecture, which looks more like a general-purpose vector processor than ever before. With ever increasing software support for GPU computing, are future GPUs set to become the new math coprocessors of yesteryear?
Many of our avid readers may have experimented with more than just 3-D gaming or overclocking on their GPUs. These days, there is increasing number of non 3-D apps using one or another aspect of GPU acceleration – whether integer or FP processing, or just immense local memory bandwidth for specific search operations – in both high performance and desktop computing realms. And, with both Linux and Windows compilers supporting this well now – look at the upcoming Microsoft C++ AMP here – in-line GPU code support may become common in many more programs.
However, will it change the direction of GPU evolution into a kind of more versatile fast maths co-processor or accelerator, just like those early 80×86 processors had 80×87 FP co-processors in the nineties (and same for for Motorola 680×0 CPUs in Macs then)? It seems that Intel wasn't the only one thinking that way with the Larrabee, although a tad too early. AMD's next generation GPU architecture, shown at their Fusion Summit a month ago, and expected to materialise initially in the Radeon HD7000 series a 28 nm process before yearend, goes there firmly.
If you look at Microsoft's own expectations on this diagram, the GPU will move further into the CPU territory, with more generic and easier to program approach, and features being accessible by wider range of applications too. So, the 'generic co-processor' way may make sense at the end.
After observing the new architecture and comparing it with the previous stuff from AMD and Nvidia alike, I concluded that the 'next gen' AMD GPU is even more of a 'graphics-enabled vector processor' than the Nvidia Fermi in their current GeForce line up attempted to be. Basically, as you see here, we're talking about a mini Cray supercomputer on a chip, with X86-compatible 64-bit addressing and memory management, essentially able to share both virtual and physical memory with the X86 main processors in the system and, if somehow connected via HyperTransport or QuickPath (later doubted though due to the Intel licensing issues) to the CPU, could literally be a very tightly coupled co-processor with its own memory on a side, yet able to address all the main memory at near CPU speed, without PCIe bottlenecks.
Side note: Yes, some will advise that PCIe v3 can use tunnelling or other protocol enhancements to allow reasonably quick inter-process communication within CPU and GPU, and even addressing both memory areas by a single thread, but nothing beats superfast, low latency, cache coherent direct interconnect like HTX or QPI.
You may have seen varied Net coverage about this next gen AMD GPU architecture all over the web, however let's look at some points that should be critical here, and where the AMD claims are in, too:
- It's basically a vector processor with scalar co-processor and graphics fixed-function hardware around it
- each Compute Unit comprises a 4-wide MIMD combo of 64-op FMAD vector processing units with 40-way SMT capability
- each Compute Unit has private 16KB L1 cache, and attached 64KB L2 cache that can be accessed by all other CU's as well as the CPU
- It has X86 compatible addressing, pointers, even page faults, with memory and L2 cache coherence between CPU and GPU
In graphics work, these vector units are treated as an unified shader array, controlled and assisted by 3-D specific fixed-function hardware for tesselation, geometry, textures and scan conversion, as well as a side hardware for HD video acceleration, as usual. In maths, though, you're basically looking at a coprocessor. The only thing missing is 'X86 instruction set extensions' of a sort to directly program the thing in assembly code, which I hope never happens – X86 has to be retired at some point in time, being the crudest instruction set architecture in history, yet overwhelming all the more elegant competitors by brute force.
Let's look at the maths capability of the new chip. Assume for a second a 1 GHz default vector Compute Unit frequency, and that it does 64 of 64-bit FMAD – fused multiply-add – operations per cycle as its vector is 64 words wide (it may still be 64 x 32-bit values in the first implementation, but let's assume this first). Each FMAD counts as two FP ops when calculating FLOPs ratings, by the way. So, at 1 GHz, each Compute Unit could theoretically provide 128 GFLOPs of double precision (or single precision if only having 64 x 32-bit values in that vector) throughput. To match the existing double precision throughput of the HD6970 card, you'd just need 7 of these Compute Units, assuming, again, that the new units are 64-bit based. A chip with 16 units will give you 2 TFLOPs, and so on as you scale up.
Of course, AMD will have to increase the memory bandwidth further on the top end chippery to feed all these units, so I am quite sure that 384-bit and likely even 512-bit GDDR5 and beyond memory subsystems for local memory will appear here. The wider buses not only increase the bandwidth, but also add to the maximum possible capacity. 4 GB of local RAM with 2 Gbit GDDR5 chips with a single load, or 8 GB with double load, helps handle as much as possible of those compute tasks in local high-bandwidth, low latency memory, without going to the CPU – yet enables CPU to, via shared memory management, directly address GPU's memory as well.
How will this impact Nvidia's and Intel's efforts? Well, Intel's 'Knights' family, with its first expected product in 22 nm process a year or so from now, can have fairly easy time acting as a coprocessor or simply another processor alongside the CPU, having its X86 front-end. If connected via QPI to its Xeon E5 Sandy Bridge / Ivy Bridge brethren, it could access all of their memory reasonably fast in parallel with its own memory system too. However, after the initial Larrabee graphics performance failure, Intel – perhaps wisely – decide to focus on the profitable HPC supercomputing niche for these multi-core vector chips. If anything, that model now looks quite in line with what new AMD GPUs intend to do in compute market too.
As for Nvidia, they could fairly easily follow a similar approach like AMD – after all, Fermi already went some way towards the 'general purpose GPGPU' goal a year before. However, integrating their GPU with Intel and AMD CPUs from the programming and even physical interconnect aspect, may not be easy. Nvidia doesn't seem to have a QPI license from Intel as we speak, and having just HyperTransport license leaves it with an option to either just stick with PCIe GPUs and work out a way to create tighter connection with any CPU, or create HyperTransport-attached GPUs which will fight with similar AMD GPUs for a comparatively small AMD CPU market at the high end.
In summary, if AMD is to use this brand new GPU architecture in the 28 nm Radeon HD7000 series, it will be the most revolutionary GPU change since the advent of programmable shaders in GPUs half a decade ago. We hope it will not affect the core graphics driver performance or stability, of course, but once the initial hurdles are overrun, this will make for a very interesting new system – and application – architecture. Yes, mixing few wide cores and hundreds of narrow cores in one application at one time may sound like a challenge, but the performance gains may more than justify it.