facebook rss twitter

NVIDIA details next-generation 'Fermi' GPU architecture

by Tarinder Sandhu on 18 January 2010, 09:31


Quick Link: HEXUS.net/qaubg

Add to My Vault: x

What kind of card is it?

This article was first published on October 1st, 2009. Stay tuned as a follow-up, full technical analysis will be published soon.

A graphics card?

Codenamed Fermi, presumably after the renowned Italian physicist, the new GPU underscores NVIDIA shift towards more compute-centric design that, on paper, is equally at home in the high-computing space as with rendering pretty-looking pixels for games.

Here's a very high-level overview of the 'GT300' chip that will go on sale in a few months' time. On a fundamental level, Fermi packs in 512 CUDA processing cores arranged in 16 banks of streaming multiprocessors (SPs) - the green rectangles - that each hold 32 execution cores. As a comparison, GeForce GTX 285's SP count is 240.

Pure processing grunt is allied to a 384-bit memory interface - six 64-bit-wide channels - that connects to GDDR5 memory. The chip is fabricated on TSMC's 40nm process and packs in around 3bn transistors.

NVIDIA hasn't divulged clock-speeds at GTC, but it's reasonably safe to assume that Fermi's memory-bandwidth will be higher than AMD's Radeon HD 5870's, thanks to the wider bus, and compute power somewhere in the vicinity of AMD's best single-GPU card. NVIDIA continues the split-clock speeds for the front-end and shaders.

A compute card?

NVIDIA's recent company-wide strategy has been to actively promote its products in the high-computing space, where margins are far healthier than for desktop and mobile parts. This is where you'd find cards such as the display-less Tesla C1060 - ostensibly a tweaked GeForce GT200 with far better software support - selling for thousands of dollars.

Fermi has been designed with the needs of HPC customers firmly in mind, then. The GPU supports the IEEE 754-2008 floating-point standard, including inherent support for the fused multiply-add (FMA) operation, and is able to process double-precision calculations at half single-precision speed, compared with incumbent GT200's one-eighth 32-bit arithmetic throughput. AMD's Radeon HD 5870 supports the same standard as Fermi, too.

Looking towards HPC again, Fermi adds in provision for ECC memory, a configurable 64KB L1 shared cache and unified 768KB of L2 cache under the banner of Parallel DataCache, and an updated GigaThread scheduler. The present GT200 architecture is such that programs are executed in a sequential fashion, taking up the entire GPU whilst being computed, yet HPC-oriented kernels may not be large enough to 'fill' the broad architecture.

Fermi's scheduling enables up to 16 kernels (programs) concurrently, keeping the GPU at near-maximum efficiency, says NVIDIA, and is complemented by faster context switching between different kinds of kernels - Fermi can only run multiple iterations of one type. Faster switching will also help in the concurrent running of GPGPU and gaming code.