Intel Lindenhurst Xeon DP Platform Discussion
HEXUS have an article coming that evaluates the latest Intel Xeon DP platform, codenamed Lindenhurst. As you'll likely know, (current) Xeon is Intel's workstation and server processor based on many of the same technologies that define Pentium 4 in the desktop space. Lindenhurst (at its most basic definition) is the combination of the new Paxville Xeon processor in DP (dual processor) form (there's a multi processor version hosted by Truland), along with Intel E7520 core logic.The Paxville generation of Xeon is dual-core and uses the latest generation of Netburst microarchitecture, making the DP version ostensibly a clone of the Pentium D 820, but with the ability to also turn on HyperThreading for both cores and with double the L2 cache. The DP version of Paxville, at $1080 in volume, is only available in 2.8GHz form for the time being, MP variant available at up to 3GHz. It supports everything the dual-core Pentium D does, including SSE3 instructions and rides the same 200MHz system bus (800MHz effective).
E7250 provides a single dual-channel DDR2-400 memory controller, and a shared bus for the CPUs to get to that memory controller from. Other stuff like PCI Express, support for the Xeon CPU's execute disable bit and support for PCI-X via a mandatory 6700PXH segment bridge (2 PCI-X segments) mean that superficially its a forward thinking, modern workstation and server platform.
However, in advance of the Lindenhurst test platform arriving for evaluation, I've caught myself wondering just how it's supposed to work with any kind of real performance outside of a couple of scenarios. It's an issue of limited resource sharing, mainly at the CPU and memory controller levels.
Not much food to go round
We've evaluated HyperThreading-able processors many times in the past, since its launch with the 3.06GHz Pentium 4, and while there's opportunity for performance improvements with a single HyperThreaded processor, performance rarely doubles because HyperThreading is the sharing of the CPU's execution resources by the Hyper threads.In an SMP scenario with Xeon, you've then got CPUs sharing a memory controller. When that memory controller only supports fairly slow DDR2-400, likely at higher latency and with a performance penalty compared to DDR-400 (even without ECC in the mix, which is almost mandatory for Xeon given the places its implemented), there's a performance issue. When the CPU-to-memory bus is shared between the two CPUs, so bus access is singular and access has to be interleaved, performance can be limited by a CPU-to-memory bottleneck.
Add in dual-core and you've now got four cores sharing one memory controller over one bus link. See where I'm going with this? Add in HyperThreading and eight logical processors in two sockets have to share that one memory resource, on one bus.
The lack of dedicated CPU bus connections to the memory controller on SMP Intel systems historically is one of the reasons why Athlon MP was able to do fairly well on introduction against SMP Pentium IIIs, CPUs which still shared the bus back then. Each Athlon MP had a dedicated bus connection to the memory controller.
With the introduction of Opteron by AMD in recent years, each CPU has its own memory controller right there on the CPU die and HyperTransport to allow the CPUs to access each other's memory controller and other connected system resources on non-heavily shared (only between a pair of CPUs, or a CPU and devices) bus links. That kind of topology, where all bus and memory access traffic isn't confined to one set of bus paths is why Opteron generally beats on Xeon in modern performance testing.
So while dual-core Opteron processors have the cores share a memory controller and HyperTransport link, that's as far as the sharing goes for the most part. Intel's comparison platforms with Xeon are sat sharing resources like nobody's business.
Where it could go right
It's not all doom and gloom, though. Think of a scenario where compute threads rarely touch system memory, doing most of their work on the CPU with small working sets and you've got yourself something that Xeon should do well at. While those compute threads would have to be HyperThreading friendly to have HT be a performance win, Intel has spent good time making sure HT gets focus by application developers.So if you're compute bound and you don't hit system memory, and your threads are HT-friendly, Paxville Xeon DP could be quite nice. But, err, that's about it really.
Where it could go wrong
If you're I/O bound by your threads in any way, you can hit problems (all threads touch the MCH, then there's a 266MiB/sec bus link to the I/O processors to cross, then the data hits disks or network hardware). If you're memory subsystem bound in any way, especially on a majority of compute threads, performance is likely gone.There's just too much resource sharing for it to all conceivably work well, especially compared to Opteron. I can forsee many a scenario where dual-core Opteron will give Paxville Xeon DP a beating. Indeed, there's a bunch of published results that confirm that's the case, but I'll leave the final conclusion until I've done the HEXUS testing on our Lindenhurst system with Paxville DP at the helm.
Paxville MP and the future
Paxville MP has the potential to be a bit more interesting with support for Intel's virtualisation technologies in hardware, something that Lindenhurst and Paxville DP doesn't support. Paxville MP will take the Xeon 7000 moniker with a range of models and the core logic required to make Paxville MP really work shows up in 2006 with E8501.Dempsey is Xeon DP with a 266MHz bus interface (1066MHz effective), FB-DIMM support and virtualisation. Tulsa is Xeon MP on 65nm with a fat L3 cache. Both Dempsey and Tulsa we'll cover in the new year when they get closer to release.
Lindenhurst Summary
Xeon continues down a long beaten path by Intel, the microprocessor behemoth neglecting to employ some techniques that we've seen AMD use with Opteron which common sense tells us might improve performance. Especially in the case where you have a large amount of possible compute threads competing for access to resources. Make those resources plentiful if there's no much to go round, right?We'll see how Lindenhurst and Paxville DP do in the near future with a look at the 2.8GHz CPUs and the 8 apparent CPUs when they take on dual-core dual Opteron 280 in a HEXUS shootout. 64-bit Windows only, we'll fire a variety of tests at the hardware to see how it does. I'll also go over power consumption, DBS and a whole bunch of other good stuff that further defines the platform and how it'll be used in the enterprise. Keep an eye out for that. It's been a while since I got my hands dirty in this arena and I'm looking forward to it.