Earlier, we looked at the background of Chinese high end microprocessor effort, as well as the most widely known of them, the Loongson MIPS family. In this second part, we cover Alpha.
Alpha was, for the long time around the turn of the century, the Formula 1 of microprocessors with its very simple, elegant yet extremely scalable RISC architecture focused on raw speed, and pure 64-bitness without any 32-bit modes or compatibility baggage. Between 1993 and 2001, the time of its untimely murder, it owned the majority of performance records, especially when it came to the processor performance – DEC (Digital Equipment Corp) system designers were sometimes too stingy with the memory and I/O systems, allowing other vendors to occassionally win the accolades in those tests. The most well known of those cores, the one that had the highest comparative performance advantage vs the competition, was 21164 a.k.a EV 5 family, which span three semicon process generations – 0.50, 0.35 and 0.25 microns.
The most widely spread volume-wise was the 0.35 micron 21164A in 1996-7, reaching up to 667 MHz, and beating the contemporary 266 MHz Pentium II by over two times in most benchmark tests of the time. The 21164 core, a simple but very high clock-optimised four-issue in-order design with two FP ops per clock, was also the most performance efficient of all Alphas, taking some 25 Watts at 667 MHz vs 75 Watts for the 600 MHz Pentium III 'Katmai' which followed few years later, still at lesser performance. The subsequent Alpha cores, such as 21264 EV6, brought up to double the performance per-clock, however at three times the power consumption per clock, a point very important when looking at the choices made later in this story.
The 21264 out-of order core was also scaled across three processes, including derivatives made by Samsung, the major Alpha architecture licensee. It, and its successor 21364 EV7, carried the performance torch until 2002 or so, well after Alpha's further public development was stopped. Do note the memory and I/O interconnect revolution with the EV7 – while the core was basically the same EV6 type, the on-chip 1.75 MB L2 cache, a 10-channel integrated Rambus memory controller with humongous memory bandwidth basically matching that of the L2 cache and enabling that cache to act as a low latency buffer for the memory system, and four parallel 6.4 GB/s coherent interconnect links to other 4 processors, scaling up to 512 sockets with directory support, were a revolution for year 2000 computing. Such things were only seen in PCs 5 years later with HyperTransport from AMD first, later followed by QPI from Intel. BOTH THESE INTERCONNECTS ARE DERIVED FROM OVER A DECADE-OLD ALPHA EV7.
Add to that more. The 21464 EV8, aimed for release in 2002 if things continued as originally planned, was to be the first processor with eight-issue wide superscalar out-of-order symmetrically multithreaded core, and we mean four threads out of each core here. The 'EV9' 21564 design was expected to add multi-core and huge, wide vector unit – up to 1 KILOBYTE wide – capability to the mix, enabling well over 100 GFLOPS DP floating point performance per core for 2004 timeframe. Remember, we are only now reaching such capabilities in late 2011, and need 6 to 8 cores for that. Anyway, the multithreading and vector enhancements designed well ahead of their time into the EV8 and EV9, never saw the light of the day in the open market.
In the late nineties, China saw the value and capability of Alpha, and built a number of Alpha systems, some of them very large for the time. It also fully licenced the Digital / Tru64 UNIX and related software stack, including getting the full source code, from Compaq after the latter bought DEC then, giving China the critical software control part. At the same time, having seen the business instabilities linked to the Digital-Compaq-HP transition, China seems to have been working on having its own Alpha flavour.
After over a decade of work and three generations of CPUs, Jiangnan Reseach Lab has shown the ShenWei (Sunway) SW-3 processor, the Chinese flavour of Alpha, not in a small workstation, not in a server, but in no less than a huge petaflop-class supercomputer machine in Jinan, Shandong – the Sunway BlueLight MPP, this past October. The CPU itself runs for over a year in a variety of systems, but displaying it running a petaflop machine was probably the best PR one could get, especially since foreign supercomputing dignitaries such as Jack Dongarra, the man behing TOP500 list and Linpack FP benchmark.
SW3 aka SW1600 is a 16-core, 64-bit RISC processor, with each core looking a lot like an improved version of the 21164A EV56 Alpha core, plus vector FP unit extension added to each core. While the initial speed range was 1 to 1.2 GHz in the 65nm process, the standard speed grade is a 1.1 GHz chip with 141 GFLOPs DP FP performance. The speed set for the Bluelight Petaflop machine's Top 500 run was 975 MHz, though. The quad-channel 128-bit DDR3 on-chip memory controller offers 68 GB/s bandwidth – yes, equivalent to 8 channels of DDR3-1066 server RAM.
The L1 and L2 cache sizes are still rather minuscule for modern CPUs, being kept at the original 21164 sizes of 2 x 8 KB L1 and 96 KB L2, however it has enabled both very small cores and also very, very low cache latencies, down to two clock cycles for L1. You can see the CPU block diagram here.
As mentioned before, 21164 core was the most power efficient of all Alphas, and also one of the most power/performance 64-bit high end CPU cores of all time, excluding the mainstream, entry level or embedded processors. So, the choice of that core for all these years by the Chinese, although they obviously – as the Loongson case shows – had plenty of resources to improve the EV6 or even EV8 cores if they wanted to – seems to prove right at this point. Remember Intel's Knights Corner, or the AMD GCN GPU architecture for compute?
The Knights Corner, being a compute version of the abandoned Larrabee project, uses a core even simpler – and slower – than Alpha 21164, basically a 64-bit version of the old Pentium, enhanced with much higher bandwidth, to act as a feeder to a vector unit behing it that provides very very fast FP. Stick a 50-odd of those on one chip, with the right cache and interconnect in between, and you got a good accelerator. The Compute Units in the AMD 7970 aren't that much different, although they are based on a native optimised architecture, rather than cumbersome X86.
So, in the Shenwei SW3, you have a simple, well proven 4-way (still double the issue of Pentium or Atom per cycle) superscalar in-order core with very small die footprint for today's processes, yet improved and with enhanced bandwidth to feed a simple, AVX-like throughput vector unit. What's the vector unit's speed then? If you normalise the speed to 1 GHz, it'd give you 8 GFLOPs DP per core, or 8 flops per cycle – not bad at all for a 2010 chip using an enhanced 1995 core! All that at very low, below 40 watts (official figures not available) per socket power consumption despite the old 65 nm process.
And, the sustained performance and power consumption in the Sunway Bluelight petaflop system were the proof of the pudding: the water-cooled 9-rack machine has 8,704 ShenWei SW1600 processors (only 8,575 of them ran the Top100 bench at 975 MHz each) organized as 34 Super Nodes (each consisting of 256 compute nodes), 150TB main memory, 2PB external storage, peak performance of 1.07 PFLOPS, sustained performance of 796 TFLOPS, efficiency 74.37%, and total power consumption 1074KW, figures that compare very well against competitive US supercomputer systems such as X86-based Jaguar.
What does the future hold for Shenwei? Well, it can either confinue where the Alpha was stopped, moving to 8-issue cores (even in-order architecture can do it these days since the compiler and scheduling evolved a lot over the past decade) and much faster FP per core, with fresh cache and memory architectures , or just tweak the current core and pack more of them in a single die at higher clock speeds as well, with wider vector units and more memory bandwidth to feed all that, a bit like RISC cousin of Knights Corner, but a true CPU here, instead of just an accelerator. Either can lead to teraflop-on-chip soon too, and either will require a rapid jump in semiconductor process used, down to 32 nm or 28 nm nodes – just like Loongson is expected to do this coming year.
Keep in mind that Alpha left behind a strong software library, not forgetting the Alpha-based Cray T3 system series here as well, and this includes one of the best UNIXes ever, as well as great compilers, optimised libraries, and much more. Coupled with its own software base, China has sufficient resources to confinue developing Shenwei on its own, with sufficient internal market. However, when it decides to go fully commercial with the effort, there will be plenty of interested partners worldwide to embrace the old-new Formula 1 of microprocessors yet again, this time with a far more stable supplier, business wise, than DECompaq was.
The Part 3 will look at the ARM and native CPUs of China.
Photo Credits: it168.com