What's new ?
64-bit work is fine in its place but for a single CPU system that's aimed at both the workstation and high-end consumer markets, it's prudent to discuss just how the new Athlon 64 FX is better than the Barton range of CPUs. We'll discuss some of the more notable points below and how and why it will do well in current 32-bit applications.
Integrated DDR DRAM memory controller
Note the small amount of space taken up on the bottom-left of the above picture ?. That's for the integrated memory controller, right on the CPU. The traditional model is to have the CPU connected up to a chipset's North Bridge, which in turn communicates with system memory. The current XP3200+ Barton's interconnect to various chipset North Bridges is limited to 400MHz. One can, of course, increase the FSB and the speed of the interconnect through manual changes in BIOSes, but one cannot get away from the fact that memory access has to offloaded to a discrete hub. What AMD has done is implement a 128-bit (2 x 64-bit channels) memory controller on the CPU itself. As it's located directly on the core, it runs at full core speed, so the faster the processor, the faster the 'FSB'. The memory controller interfaces with system RAM through a DRAM controller, which can be modified as new memory standards emerge. An integrated memory controller, it is reckoned, can reduce latency penalties by up to 25%, and takes away the need for chipset makers to get an efficient memory controller right the first time around; AMD has done the hard part for them. As the memory controller runs at full core speed, the DRAM controller is a divisor of the CPU's speed, such that traditional PC1600, 2100, 2700 and 3200 speeds are catered for.
It's important to note that, depending upon the CPU, exact RAM frequencies may not always be obtainable. It just depends on how close the divisor falls to the required speed. For example, take a 1.8GHz Athlon 64 FX. If we require the use of PC3200 memory (200MHz), we have a divisor of 9 (1800/9), which fits in perfectly. PC2700 memory, however, isn't in line with an integer divisor. The closest one can get to is 163.36MHz (1800/11). Similarly, 133MHz memory closest setting is 128.57MHz (1800/14). If one increases the CPU's speed artificially and keeps the same divisor, the RAM speed rises in accordance. This all-new feature should be one of the Athlon64 FXs biggest performance weapons. Memory-intensive applications will just love the lower latency access. The review Athlon 64 FX, running at 2.2GHz and using registered DDR400 memory (11 divisor, naturally), should be something of a monster performer.
Larger L2 cache
The size of the Athlon64 FX's die, measuring 193mm², is almost twice as large as the Barton's 101mm² and comfortably more than twice as large as the Thoroughbred's 84mm². It doesn't take a rocket scientist to work out where the bulk of 105.9 million transistors have gone. On-die cache, running at the same speed as the CPU, and varying in latency depending upon how far it is away from the 'core', is an effective method on increasing performance. If the CPU is looking for some data and cannot find it on-die, it'll have to resort to going back to system memory. The relative slowness of system RAM is a performance inhibitor. Carrying 1024kb of L2 cache balloons the transistor count enormously, as you can see from the above picture, but it allows for more frequently used pieces of data to be stored on-chip. CPUs use intelligent pre-fetching of data to load up on cache. Having more CPU cache is better, but there comes a point where manufacturing expense makes additional cache economically unfeasible.
ECC memory
The Athlon64 FX still carries the same 128kb L1 cache (64kb for instructions and 64kb for data) as the current K7-series, but in a nod towards mission-critical operations, it contains ECC (Error Correcting Code) protection for L1 cache data and DRAM. From enthusiasts' / gamers' points of view, this'll be the major stumbling block for 'cheap' performance. The Athlon64 FX requires the use of registered / ECC memory. It's a workstation Opteron 148 chip with a different, catchy name. High-speed ECC modules aren't easy to come by and, more importantly, won't be as cheap as premium non-ECC PC3200+ RAM. That's where the Clawhammer will fit in well; it doesn't need registered DIMMs.
Larger L2 TLBs
The L2 Translation look-aside buffers have been increased from 256 entries to 512 entries. A TLB is just a list of pages that have most recently been accessed from main memory. Assuming there's a cache miss, the CPU then looks in the TLB for a reference to the required page in system memory. A quick look in the TLB, assuming the list contains the page address the CPU is after, shortens the time that it takes to get data back into the CPU. The Athlon64 FX's larger L2 TLB, therefore, holds a list of more frequently accessed pages in main memory than the K7-series, which should help keep cache miss penalties down to a minimum.
Longer pipelines
One of the K7-series characteristics was in employing a 10-stage integer and 15-stage floating-point pipeline. This has been extended to 12 and 17, respectively, for the Athlon64 FX. Longer pipelines allow a CPU to scale in MHz terms, but extra stages also present the unwanted side-effect of extra penalties for cache misses, as the entire pipe needs to be flushed before it can be used again. Still, the 12-stage integer pipeline is far short of the current Pentium 4's 20-stage monster, so we doubt that 4GHz+ Athlon64 FXs are just around the corner. To remedy the current performances disadvantages imposed by a longer pipeline, it's prudent to minimise cache misses in the first instance. Here the Athlon64 FX contains an improved branch prediction unit. The better prediction unit should balance out the longer pipeline.
Silicon-On-Insulator Technology.
The Opteron / Athlon64 FX processors are still manufactured on a 13-nanometer process, so no change there. However, current XP CPUs' transistors sit directly on silicon substrate, such that some electrons manage to escape out to the substrate. Silicon-On-Insulator (SOI) is exactly like it sounds. A layer of silicon dioxide sits in between the transistor(s) and substrate. The advantage is in ensuring that no electrons can escape to the wafer underneath. It also reduces the thermal requirements for operation at a certain MHz; there's no need to account for electron wastage. AMD hopes that this will allow it to scale the Athlon64 comfortably past 2.5GHz in the next few months. It will need to if Prescott is all that Intel are currently implying it is. The SOI technology should help make the Athlon64 a little bit more efficient than the current Barton, and also help increase its instructions per clock cycle ratio, as will most of the factors listed on this page.
HyperTransport links
The processor directly communicates with system memory via a DRAM controller. AMD has decided to incorporate a number of system / CPU buses on the Opteron-based CPUs. By default, all Opterons have 3 HyperTransport buses, but the number of coherent links is decided by virtue of which Opteron we're referring to. SMP-capable CPUs have a coherent HT link to one another, as well as a standard 800MHz HyperTransport link to the chipset. The 4 or 8-way Opterons, though, will have 3 coherent HyperTransport links. For our FX-51, there's no need to have any coherent links, as it won't be used in anything other than a uniprocessor system, and it'll use a 16-bit 800MHz HT link (3.2GB/s in each direction) which will funnel all the high-speed storage and connectivity that we're used to seeing on modern motherboards. Some motherboard manufacturers may choose to knock down the link speed due to the probability of signal interference at high transport speeds.
SSE2 support
A number of workstation-class programs have had code optimised for Intel's SSE2 (Streaming SIMD Extensions 2). 8 extra 128-bit registers (memory location on the CPU ) have been added to afford SIMD-based calculations (SSE2, etc) more room and therefore speed. One of the P4's chief speed gains arose from having software fully exploit the speed advantages gained by the SSE2 SIMD instructions. AMD has levelled the playing field in this respect, and will be able to run it all in hardware.
That's some of the more notable improvements in the Athlon64 FX core. Besides the performance-inhibiting slightly longer pipeline and ECC support, each of the above factors will help in ensuring that the Athlon64 FX will do more work in each clock cycle than its predecessor, the Barton K7. The amount of work will most likely depend on just how memory intensive the activity is. However, it's important to understand that not all applications will see a benefit on a clock-for-clock basis. MP3 crunching, for example, that doesn't need to resort to memory accesses won't show a major improvement at all.
All this talk has left your truly ready for some benchmarking action. We now know that it should be faster than the XP3200+, but by how much ?, and, more importantly, is it faster than the 3.2GHz Pentium 4 / Canterwood combination.