Our Theo covered a bit of the current stage of Xeon Phi a.k.a MIC sampling. Is there more meat left on the table?
Possibly, yes. The B0 sample list is topped by 1+ GHz 61 core part, enough to reach roughly 1 TFLOPs DP performance peak, akin to AMD's top level, just released, FirePro W9000 card – except that the Intel offering has more RAM, 8 GB on a 512-bit bus, vs 6 GB on a 384-bit bus on AMD. Of course, the AMD part supports PCIe v3 for faster host data transfers, while the current Phi is limited to PCIe v2. And oh yes, Xeon Phi is a sort of mutated X86 with less of some post-Pentium stuff but plenty of SIMD resources superior even to AVX, a 512-bit wide unit on each core with FMA, compared to 256-bit wide without FMA on the current Ivy Bridge (Haswell will add FMA, though – but only in 2014 for workstations and servers).
The real competitor that Intel aims Xeon Phi at is, of course the infamous Nvidia. While its first generation of Kepler chips had near non-existent double precision FP performance, and not exactly great single precision FP either, the upcoming (yearend?) GK110 chip, supposed to show up in expensive Tesla K20 cards, then the workstation Quadros, before (if) any are left for the GeForce gaming, should solve that problem – it's claimed to have near 1.5 TFLOPs DP peak speed limit. Now, Nvidia is rushing to show off at least some of these in time for the SC supercomputer conference in Salt Lake City this November, but maybe at the cost of reduced performance on those initial samples.
Either way, Intel's real target was to have a full 62-core (yes, that many are on the die itself) Xeon Phi run at about 1.3 GHz, or even more for selected speed bins, to match the claimed speed of GK110 before the latter even appears! What seems to stand in the way are the yields, both in terms of core count and for the frequency delivered. What I heard this weekend is that Intel is feverishly working on fixing these two too, so that they can spoil Nvidia's chances at SC show too – remember, this Larrabee follow-on is, after all, made in 22 nm process, giving them a half-node edge over Nvidia and AMD.
Another aspect is, of course, that the simple recompile usually sufficient to make the codes work on Phi beats CUDA porting time by a factor of ten, according to expert Singapore and China users I spoke to. This is a big advantage, but it'd be even bigger if Intel put Xeon Phi straight on the QPI interconnect to share the system memory with 'standard' Xeons, or, well, let the Linux boot off a future Xeon Phi itself, making it a FP-intensive CPU. Will that happen in the next generation? Watch this space.