x86-64 III
As far as physical implementations of x86-64 go, we've known what Opteron would be for a long time now. As a much simplified breakdown of the CPU itself, it purports to be the original K7 core with 2 extra pipeline stages, a much improved branch predictor to offset the penalty of a pipeline stall due to branch misprediction, 1MB of L2 cache memory, an onboard memory controller supporting two channels of DDR memory and the x86-64 implementation details we discussed earlier. Let's talk about each in turn. We'll leave out talk of the consumer implementations since it's 10x more confusing and thankfully, outside of the remit of this article.Extra pipeline stages
With 12 pipeline stages versus 10 on current K7 and 20 on current Netburst CPU's like Pentium 4 and Xeon, it's worth mentioning what that means for performance. It's well known that the reason Netburst can scale so well in terms of final CPU frequency is that its pipeline is so long. A longer pipeline means less work done per clock cycle, giving access to more headroom in terms of clock speed, but also large speed penalties should things not execute optimally.So with AMD's desire to increase clock speeds over the current K7 limits of around 2.2GHz, lengthening the pipeline by a couple of stages was a fair tradeoff in the pursuit of that goal. To limit the performance penalty from a pipeline stall due to the extra stages and therefore smaller IPC, they improved the CPU's branch predictor. The branch predictor is responsible for essentially guessing what code is going to execute next on the processor. Modern application code is heavily condition based. The software makes choices on what to do based on data fed to it. Abstracted down on to the processor, it's pertinent for the processor to make an intelligent guess at what data is going to be needed next, so that it can prefetch the data ready for execution.
The branch predictor improvements mean that the hardware guessing gets it right more of the time, meaning the penalty for a branch misprediction is lessened.
Longer pipeline, lower IPC (without taking the new cache into consideration), but higher clocks and hopefully less of a penalty from branch mispredictions than before.
Cache enhancements
The L1 setup on Opteron is identical to K7. 64KB, 2-way set associative data cache with a 64 byte cache line size and the same size and setup instruction cache, for a total of 128KB of split L1 cache.The L2 setup on Opteron is exactly the same as on the outgoing K7 (Athlon XP 'Barton'), just doubled in size and with lower latency. With the same 64 byte cache line size (the size of data block it can read and write to the cache in a single cycle), 16-way set associativity, it grows in size to 1024KB (1MB) from 512KB on the 'Barton' XP.
In Opteron operating in 32-bit mode, the 1MB L2 cache is double the size of the 'Barton' cache setup. But operating in 64-bit mode, the cache used for storing data and memory addresses in processor registers is essentially halved. Data and memory address data is usually a fraction of the data stored in cache memory on a processor, especially in L2 cache, but it's an effective halving anyway. So it's not a true doubling of cache size in a performance sense, but double in terms of physical size. The cache also has ECC checking.
Along with the size increase and lower latency (bringing with it an effective bandwidth increase), the cache layout has improvements to its transition lookaside buffers (TLB's). TLB's are used to speculatively point to data the processor thinks will be needed in the future in the form of a map between a virtual address and a physical memory address. Part of the memory mapping unit, the buffers store the locations of data that will more than likely end up in cache memory in the near future, enabling hardware prefetch of data. x86-64 also provides function for software controlled prefetch via a CPU instruction. That enables the software to hint at data that's likely to be used soon, so the CPU can prefetch effectively. The TLB is also used to store previously accessed memory page addresses, so in the event of a cache miss, data can be pulled again from memory quicker than if it had to reference the page table from memory.
On-board memory controller
The decision to integrate the core of the traditional northbridge functions on to the die of the new processor was, at first glance, a bad one. Despite the obvious advantages of much lower latency when accessing data, the type of controller can cause problems when matching it with an available memory technology. As newer memory technologies appear, a transition to them wont require a new motherboard, as happens now, rather you'll require a new processor since that's where the memory controller now resides. AMD thought ahead, with the result that the Opteron's memory controller may be disabled by any host chipset that can provide an off CPU memory controller, but that negates any advantages from being onboard the CPU.However, now that processors are in the hands of users and evaluation of the memory bound performance of Opteron has been done, we can clearly see that the decision was a good one. Much of the performance of current Intel desktop solutions comes from its memory controller performance on chipsets like Canterwood. It's an extremely low latency controller on i875P, so with a similarly performing controller on board the CPU die on Opteron, expect memory bound performance on similarly clocked Athlon XP and Opteron systems to favour the Opteron.
The Opteron, being a dual channel controller with no option of running in a single channel mode, requires that memory be installed in pairs. With a 144 bit memory bus, that means 72 bits per channel, 64 bit + 8 bit ECC. The Opteron also requires registered memory, so sourcing paired, registered ECC memory is your only choice in an Opteron system.
With the Opteron linked to other system components by means of the high bandwidth, high speed bus of HyperTransport, NUMA becomes an integrated part of an Opteron system, especially in a multi processer situation. NUMA, non uniform memory access, allows low latency access, for any process running on a NUMA aware Opteron system, of the system memory attached to any other Opteron processor.
A quick page on HyperTransport and how it relates to an Opteron system, and we're done with the x86-64 specifics and can look at a real life implementation.