Core improvements
AMD's K8 processor core has remained almost unchanged since its introduction in 2003. With Barcelona, though, AMD is introducing a range of changes, largely aimed at lowering power consumption and increasing performance.
Saving power
Reduced power consumption is the claimed result of several new techniques used in Barcelona.
Independent Dynamic Core Technology allows the clock-speed of individual cores to be varied depending on load. On Athlon 64 X2 processors, the cores' clock speeds are linked together. So, if one core is under load, the others have to increase their clock speeds, too, and that wastes power.
With Barcelona, each core is equipped with its own phase-locked loop (PLL), allowing clock frequencies to be scaled independently and reducing power consumption.
The memory controller is also able to vary its clock speed depending on load, rather than run at full core speed. In the 2GHz CPU, the memory controller runs at frequencies up to 1.8GHz.
AMD CoolCore Technology adds additional gating of transistors, allowing unused parts of the processor to be powered down, although not to the extent that an entire core can be switched off.
In addition, on mainboards that support two independent power planes, Barcelona offers Dual Dynamic Power Management. This allows the processor cores and memory controller to run at different voltages, though the voltages of all four cores are dictated by the core with the greatest load.
And, no, Dual Dynamic Power Management does not counteract the power-savings of Independent Dynamic Core Technology.
How so? Well, power consumption scales (roughly) linearly with clock speed and squared with voltage, so there are still power savings from the reduced frequency of the different cores even if the voltages of all cores stay the same.
Improving performance
Unlike Intel, which introduced a wider, four-issue core with its Core microarchitecture, AMD has stuck with the tried and tested three-issue width of the K8.
But there are other significant changes, as you'd expect given that Barcelona has reduced clock speeds compared to K8 and Intel's Core microarchitecture has dominated K8 on a clock-for-clock basis.
AMD Wide Floating Point Accelerator is the name given to the revised 128-bit-wide floating-point unit. This should see a significant increase in floating-point throughput as 128-bit instructions no longer need to be split into two 64-bit operations and processed separately. In particular, scientific workloads and high-performance computing are expected to enjoy big gains.
While Barcelona maintains independent L2 caches - 512KiB for each core - these have been joined by a 2MiB L3 cache that is shared between all four cores. Dubbed AMD Balanced Smart Cache, this acts as a store for data evicted from the cores' L2 caches, as well as for pre-fetched data that is likely to be needed by more than one core.
Virtual gains
Virtualisation is one of the key buzz-words in corporate IT at the moment and with good cause since it can result in a considerable cut-back in the number of servers required. This consolidation results in reduced power usage and lower costs for hardware and maintenance, along with increased flexibility and system utilisation.
Both AMD and Intel have had some degree of hardware virtualisation-support in previous-generation products but Barcelona is the first CPU to offer Rapid Virtualization Indexing - something that AMD no doubt hopes will help it win back some market share in the server arena.
Assuming there's software support, Rapid Virtualization Indexing allows virtual machine memory-lookups to be carried out in hardware. At present, the translation between virtual and physical memory has to be carried out in software by the hypervisor virtual-machine monitor program.
This change should significantly reduce the overhead placed on the system by the hypervisor, giving improved performance for virtual machines through a combination of faster translations and reduced CPU load from the hypervisor.
Memory-controller enhancements
AMD has made some changes to Barcelona's memory-controller to improve efficiency. These it jointly refers to as AMD Memory Optimizer Technology and claims increase memory bandwidth by up to 50% compared to a DDR2 K8 processor of the same speed.
Improved Core and DRAM pre-fetchers that predict what data will be needed by the processor based on patterns of access - and pull it into cache - help to mask the latency caused by having to access data from memory.
Write-bursting is used to minimise the penalty for switching the memory-controller between read and write operations. This sees write requests stored in a buffer and then all carried out in sequence once the buffer is full. Unfortunately, it is unclear if this buffer can be read from when requests are made to read buffered data or whether, instead, data has to be written to memory and accessed from there.
The introduction of memory-controller un-ganging - letting the memory-controller act as two separate 64-bit controllers, one per channel - also helps improves throughput. This allows two 64-bit read or store operations to be carried out simultaneously, whereas K8 treats all operations as 128-bit, potentially wasting resources.
However, this does have disadvantages. Un-ganging the memory controllers results in the loss of double-bit error-correction. Because of this, AMD has made un-ganging optional - those using Barcelona for mission-critical applications will want to leave it off.
Finally, Barcelona promises improved paging algorithms and larger memory buffers to help optimise the memory-controller for DDR2 data rates.