Penryn: Core 2 evolved
First up on the agenda was a discussion on the improvements that Intel's upcoming Penryn core will provide over the present Core 2 microarchitecture, currently represented by Merom, Conroe, and Clovertown/Kentsfield cores.
Penryn - a bit of background on why 45nm is lovely
HEXUS reported on Intel's progression to 45nm technology at the back end of February 200745nm, why the need?
Serving as a quick recap, Intel announced that it was basing its evolution to the Core 2 microarchitecture, codenamed Penryn, on a 45nm manufacturing process that took advantage of a breakthrough in transistor design.In a nutshell, the smaller process uses a high-k metal gate silicon process that replaces the traditional silicon dioxide insulating layer, between substrate and transistor, with a hafnium-based high-k gate oxide which allows for a thicker (and better-insulating) layer to be used. This, in turn, leads to lower electrical leakage; a crucial requirement with ultra-small manufacturing processes. The new metal gate replaces the traditional polysilicon version and provides for a better electromagnetic field, helping switching times.
The upshot? Considerably less current leakage and a faster switching time, which translate to a more energy-efficient design that will have an innate propensity to clock higher.
Penryn additions to Intel Core microarchitecture
A smaller manufacturing process isn't all that's new to Penryn, though.Let's trot out the old PDF foil and go through what's being bolted on to an already decent architecture.
First off, it's important to note that the Penryn is a complete family of processors, encompassing workstation/server, desktop, and mobile parts, so what's applicable to one sub-family is generally applicable across the board.
Penryn refers to the next evolution of dual- and quad-core CPUs that are run off the present LGA775 form factor.
Images courtesy of Intel.
Now, the Core microarchitecture's key performance-defining benefits are shown on the left-hand side. We've covered them in some detail previously.
Penryn's additions are shown on the right, so let's go through them and attempt to delineate their usefulness in improving performance.
Fast Radix-16 Divider
Penryn incorporates a new algorithm, Radix-16, for dividing instructions and commands at 4 bits at a time, compared to 2 bits for the incumbent Conroe/Kentsfield. The divide instruction is pervasive across applications, used in both floating point and integer calculations, so a double-fast algorithm adds some more juice to the CPU's computational speed.
Enhanced Intel Virtualisation Technology
More prevalent in the workstation/server community, Intel VT technology - where multiple, hardware-isolated partitions can run on the same machine - is boosted with a reduction in the time taken to transition between virtual machines on a purely hardware level. Intel quotes a boost of up to 75 per cent.
Larger cache sizes
Large amounts of on-chip cache is a good thing. The ability to load and locally store an application's working is an effective, if transistor-costly, method of increasing performance, as on-chip cache speeds are an order of magnitude faster than accessing external memory on a regular basis.
In particular, dual-core Penryns will pack up to 6MiB of L2 cache and quad-core models up to 12MiB. In transistor terms that's around 840m for a QC part; it's just as well Intel is packing them into a space-saving 45nm process, then.
Split load cache enhancement
Cache is cache, right? However, the effectiveness of cache is directly related to just how well data can be crammed into it. Should tags not correctly align with the cache line (too big, perhaps), which contains an index of what's in the cache, transfers to the execution core can be an inefficient process. Penryn has a split-load cache, which, as the name suggest, is able to split the data and associated tags up to better fit into the cache's lines.
Higher bus and core speeds, heat
Just as adding more cache is an established method of increasing overall performance, fattening the FSB pipe, which delivers data from the system's memory to the processor, increases memory bandwidth and, ceteris paribus, performance.
Intel's raising the FSB speed to an effective 1600MHz, although that's only applicable to selected Xeon-based SKUs which already run at 1333MHz FSB. Selected desktop processors should see a hike to 1333MHz FSB, too, and will be officially supported on Intel's Bearlake motherboards. We'd expect Penryn-based processors to work in most present performance-oriented motherboards, including NVIDIA's nForce 680i SLI with, presumably, a simple BIOS update.
Thinking about it in a performance sense, Intel has to increase the FSB, solely because of the memory contention that a 'four-core' CPU with a shared FSB places on the system. Intel has been able to mask contention shortcomings with an intelligent architecture, but it's, frankly, an inelegant interface for a multi-core processor.
Intel is being coy about the initial range of clockspeeds for Penryn, but we believe that they will debut - in a server/workstation and desktop market - with a 3.46GHz maximum clock. Later revisions, of course, will see that pushed up towards 4GHz.
With respect to desktop SKUs, dual-core models are slated to consume 65W TDP, matching the incumbent line-up. Quad-core, sporting up to 12MiB of cache, will consume either 95W or 130W. Server parts will continue to ship at 50W/80W/120W, with the increased transistor count counteracted by the energy-efficient manufacturing process. Mobile parts, too, have been designed to fit into current thermal envelopes, and all Penryn SKUs will be a drop-in upgrade from present models.
SSE4 and Super Shuffle Engine
SSE4 was designed to be debuted with the Nehalem core (juicy information morsels for this on the following page). SSE4 adds a bunch of multimedia-related optimisations that will be manifested in a desktop environment by better media-encoding performance.
Super Shuttle Engine sounds like a Japanese-esque nomenclature for an advancement that adds a 128-bit-wide shuffle unit. In plain English it's useful for a number of imaging and video programs that use what are termed shuffle-like operations such as pack, shift and unpack. It'll be interesting to put this to the test.
Deep Power Down Technology and Enhanced Intel Dynamic Acceleration
The Intel Core 2 microarchitecture introduced enhanced power-saving states that gated the CPU down during idle periods. The DPDT is an extension that further pushes down energy requirements during, you guessed it, idle periods.
EIDA is an interesting inclusion. Should the current application be single-threaded, whereby there's no intrinsic advantage of having multiple cores working concurrently, EIDA pushes up the single-core frequency to above specifications. That could mean a 2.93GHz part auto-overclocking to, say, 3.2GHz. Sounds like a good bet for isolated gaming, where the majority of titles are still single-threaded.
Summary
We've trotted out a number of enhancements that Penryn possesses over and above current dual- and quad-core Core 2-based CPUs, but, really, they're architectural bolt-ons that, on a clock-for-clock basis, will provide somewhere in the region of 20 to 30 per cent extra performance. There's nothing radically new here, just as we suspected, and Penryn constitutes a natural progression for Core 2. Widespread availability is scheduled for 2H 2007, so expect Penryn-powered boxes for Thanksgiving and Christmas.Does it have enough oomph to battle AMD's Barcelona? We'll find out.
Nehalem is up next and it packs in some shocks. Read on.