Intel's Xeon Phi, even before the official announcement in the next few weeks, already has an impressive site in Texas. Should Nvidia be worried about the newcomer?
Intel showed us the first major Intel Xeon Phi deployment site at TACC in Austin, Texas – just a stone's throw from IBM Power engineers' den, by the way. The 6,400 node supercomputer will in its fully deployed first phase have that same number of 1.08 DP TFLOPs Xeon Phi 'special edition' KNC cards running as one per dual Xeon E5 node for a formidable almost 7 PFLOPs of added horsepower on top of 2 PFLOPs by the node CPUs themselves. The combined 9 PFLOPs may sound impressive in the regional matters here, but it's still minute compared to over 100 PFLOPs to be seen by an even larger Xeon Phi monster, the Guangzhou Supercomputer Centre to be online well before this date next year.
So, despite the dismal failure of Larrabee project as a graphics engine, its core, repackaged as a HPC accelerator, seems to have fairly good initial reception in the supercomputing community. Yet, it isn't really faster than the AMD FirePro W9000 or the upcoming Nvidia Tesla K20. What's the reason for the good acceptance?
First, the programming approach with mostly simple recompile from an existing X86-oriented code, compared to still messy CUDA or OpenCL, does help. A day vs a month in code moving time can mean a lot to a researcher hard pressed to deliver results.
Second, cutting off the now unwanted graphics related parts from the old Larrabee to create pure HPC accelerator hardware has freed some space on the dies to have other features, like for instance wider 512-bit GDDR5 (compared to 384 bits on the AMD and NV GPGPUs right now) and enable more memory, say 8 GB compared to 6 GB for the competitors. When you must keep things in local memory for maximum performance, every gigabyte counts, and here there are two extra.
Third, well, Intel would be able to offer these in a package together with the base Xeon CPUs, something very important when budgeting for large machines. Like it or not, Nvidia can't match that.
Last but not least, the slight disappointment in the Tesla K20 specs, coming from the GK110 supposedly having to block off a gigabyte of its RAM for ECC (??) according to the grapevine, and overall reduction in the expected DP FP performance from 1.5 to less than 1.2 TFLOPs, coupled with the CUDA issues, resulted in more of the users looking at the alternatives.
Will Xeon Phi cripple the Nvidia's most profitable GPGPU line, the Tesla HPC cards, the same way it brought down AMD CPU market share? Not so fast, I think – Nvidia is far more aggressive than AMD in protecting its turf, including using various means of protection, which by itself deserve a separate story (or an action movie, maybe). Marshalling the media and channnels in its favour was one of its known strengths, until of course it also started making more enemies than it could chew in those same circles.
On the other hand, Intel has far more heavy weaponry on its side, including full control of the semiconductor process, something that fabless Nvidia and now AMD cannot enjoy, as well as possibility of far tighter future interconnection of the Xeon and Xeon Phi, whether it's using the QPI bus for shared memory and cache coherency, or having them as both being bootable CPUs in mixed-mode clusters. Intel has seen how Nvidia made some serious money selling Tesla cards these past few years, and now they want that same market, with not much except Nvidia standing in their way.
Intel told us they're committed to keep substantial DP FLOPs advantage over their own CPUs for the Xeon Phi, which will not be easy as 2014 Haswell-EP Xeon E5 v3 will reach three quarters of teraflop per socket already, and its successor, Broadwell-EP, will likely touch teraflop DP FP per socket (or 2 TFLOPs in typical dual socket workstation) without having to rely on clumsy heterogeneous accelerators. This means a 2015 Xeon Phi should have at least 3 TFLOPs DP FP per card to justify being used at all.
Now, this push might leave the Nvidia Teslas by wayside if they don't follow the advance. Will the Maxwell and Newton bring some good news for Nvidia? They better do, as multi instruction set handling still falls second to pure performance when considering HPC choices.