AMD Phenom quad-core demystified
Two ways to skin a quad-core cat
AMD and Intel currently have divergent views on how to architect a quad-core desktop processor. Intel's quad-core parts - for desktop systems as well as servers - are based on a pair of dual-core chips put together on a single substrate - an arrangement know as a multi-chip module (MCM). The quad-core processors are equipped with either 8MiB or 12MiB L2 cache - 4MiB/6MiB per chip - and accessed via a common front-side bus (FSB) whose bandwidth is currently limited to 10.6GiB/s (1333MHz).
Cache contention and access latency can be problematic with four cores potentially thrashing away and requiring data to be streamed from main memory through a discrete memory-controller hub (MCH) on the motherboard, via the FSB. However MCM has benefits of its own. First, the time to market is reduced and that's contributed to Intel's one-year lead over AMD in desktop quad-core x86 processors. In addition, each dual-core assembly is smaller that a quad-core, so yields are improved and costs reduced.
AMD on the other hand reckons that 'native' quad-core is the way to go - using a monolithic piece of silicon to house all four cores, cache, and, of course, integrated memory-controller. Let's take a gander at how it looks in a high-level overview.
Overview
The Phenom quad-core processors are based on a 65nm SOI manufacturing process. Four cores are situated on a single piece of silicon. Each core has its own 512KiB of L2 cache and all four cores share 2MiB of L3 cache Cores can directly communicate with one another and the memory interface via an integrated crossbar switch. The processor hooks up to the system via AMD's Direct Connect Architecture through an upgraded connection - HyperTransport 3.0.
The Phenom retains AMD's integrated memory-controller, albeit with significant performance tweaks, and it now officially interfaces with DDR2-1066 system memory. AMD has also introduced a number of other architectural tweaks to boost performance on a clock-for-clock basis when compared to K8.Furthermore, it's introduced a number of power-saving optimisations. Now, let's look at it in a little more detail.
Improving performance
Unlike Intel, which introduced a wider, four-issue core with its Core microarchitecture, AMD has stuck with the tried-and-tested three-issue width of the K8. Phenom retains K8's 12-stage pipeline, too.
But there are other significant changes, as you'd expect given that quad-core Phenom has reduced clock speeds compared to K8 even though Intel's Core microarchitecture has dominated K8 on a clock-for-clock basis.
AMD Wide Floating Point Accelerator is the name given to the revised 128-bit-wide floating-point unit. This should see a significant increase in floating-point throughput as 128-bit instructions no longer need to be split into two 64-bit operations and processed separately. In particular, any math-based application is expected to enjoy big gains.
While Phenom maintains independent L2 caches - 512KiB for each of the four cores - these have been joined by a 2MiB L3 cache that is shared between all four cores. Dubbed AMD Balanced Smart Cache, this acts as a store for data evicted from the cores' L2 caches, as well as for pre-fetched data that is likely to be needed by more than one core.
Intel has gone down the route of adding swathes of extra L2 cache on its Penryn-based CPUs, so AMD's total of 4MiB chip-based cache for Phenom may seem meagre compared to the 12MiB of the Intel Core 2 Extreme QX9650 12MiB.
However, AMD's architecture is such that it is relatively performance-inelastic much above 4MiB. As a result it's not wasted transistor-budgets on bloated cache levels when the design would provide little or no performance increase.
As a further nod to efficiency, Phenom can load 32-byte instructions from its L1 caches in to the buffers, up from the K8's 16-byte pre-fetch. Further, Phenom has a more-efficient branch-prediction unit for better performance in applications such as Visual C++ and Java. Phenom also features an enhanced out-of-order load engine for more-efficient process of instructions, as well.
Together, this all delivers better performance-per-clock than Athlon 64 X2 processors. Even in a desktop environment, Phenom has support for 48-bit memory addressing, opening up the possibility of 256TiB of system memory but that benefit is far more applicable to the sister Barcelona server processor, we reckon.
Saving power
Reduced power consumption is the claimed result of several new techniques used with the desktop Phenom core.
Independent Dynamic Core Technology (IDCT) allows the clock-speed and multiplier of individual cores to be varied depending on load. On the previous-generation Athlon 64 X2 processors, the cores' clock speeds are linked together. Thus, if one core is under load, the others have to increase their clock speeds, too, and that wastes power.
With Phenom, each core is equipped with its own phase-locked loop (PLL), allowing clock frequencies to be scaled independently and reducing power consumption.
The memory-controller is also able to vary its clock speed - up to 1.8GHz - depending on load, rather than run at full core speed.
AMD Cool'n'Quiet 2.0 adds additional gating of transistors, allowing unused parts of the processor to be powered down, although not to the extent that an entire core can be switched off.
In addition, on mainboards that support two independent power planes - AM2+, Phenom offers Dual Dynamic Power Management. This allows the processor cores and memory-controller to run at different voltages.
And, no, Dual Dynamic Power Management does not counteract the power-savings of Independent Dynamic Core Technology.
How so? Well, power consumption scales (roughly) linearly with clock speed and squared with voltage, so there are still power savings from the reduced frequency of the different cores even if the voltages of all cores stay the same.
Memory-controller enhancements
The Phenom's dual-channel memory-controller is designed for DDR2-1066, offering a potential 17GiB/s bandwidth from main memory. Remember, though, that memory-access does not have to traverse a discrete MCH; the controller is integrated right on to the die. AMD's 45nm-based Phenoms will feature an integrated DDR3 controller.
AMD has made some changes to Phenom's memory-controller to improve efficiency. These it jointly refers to as AMD Memory Optimizer Technology and claims increased memory bandwidth by up to 50 per cent compared to a DDR2 K8 processor of the same speed.
Help in masking the latency caused by having to access data from memory comes from improved core and DRAM pre-fetchers that predict what data will be needed by the processor based on patterns of access - and pull it into cache.
Write-bursting is used to minimise the penalty for switching the memory-controller between read and write operations. This sees write requests stored in a buffer and then all carried out in sequence once the buffer is full. Unfortunately, it is unclear if this buffer can be written to when requests are made to read buffered data or whether, instead, data has to be written to memory and accessed from there.
The introduction of memory-controller un-ganging - letting the memory-controller act as two separate, independent 64-bit controllers, one per channel - also helps improves throughput in certain scenarios, especially multi-tasking. This allows two 64-bit read or store operations to be carried out simultaneously, whereas K8 treats all operations as 128-bit, potentially wasting resources.
Phenom, AMD says, has improved support for mismatched DIMMs, offering users the opportunity to run modules of different speeds and timings without sacrificing significant bandwidth.
Finally, Phenom promises improved paging algorithms and larger memory buffers to help optimise the memory-controller for DDR2 data rates.