facebook rss twitter

Review: BFG (NVIDIA) GeForce GTX 280: does it rock our world?

by Tarinder Sandhu on 16 June 2008, 14:01

Tags: GeForce GTX 280, NVIDIA (NASDAQ:NVDA), BFG Technologies

Quick Link: HEXUS.net/qanni

Add to My Vault: x

All the technical bits





Lots of concurrent threads are good, of course, but they only tell part of the story. Each of the GTX 280's 10 clusters has what NVIDIA terms a dual-quad texturing unit, and you see that represented by the eight brown TFs in the cluster picture.

Filter me, baby

As with the GeForce 9-series, each GTX 280 cluster unit can address and filter eight bilinear pixels per clock - based on eight-bit integer, INT8 - or four 16-bit floating-point pixels (useful for HDR), meaning half speed.

ATI's Radeon HD 3870 can address and filter FP16 pixels at full speed, however, but its filtering is compromised by a lack of texture-units.

Going back a generation, GeForce 8800 GTX, on each of its eight clusters, was also able to texture-filter eight pixels per clock - just like GeForce 9 and GTX 280 - but could only texture-address and bilinear-filter (INT8) four per clock. Each unit could also handle four FP16s. Kind of complicated, huh?

Bottom line: the matched addressing and texturing capabilities are carried over from the 9-series GPUs. But we'd have been happier seeing FP16 texturing being filtered at full speed.

Overall, though, NVIDIA wins the FP16 pixel-filtering battle. GTX 280 can do 40ppc, 9800 GTX 32ppc and ATI's Radeon HD 3870 a total of 16 - the same as its texture-unit count.

Adding it all up, GeForce GTX 280 beats out GeForce 9's filtering by having more clusters - 10 versus eight - meaning bilinear filtering of 80 pixels per clock compared to 64 per clock.

NVIDIA needs to strike a decent balance between the texturing and shading ability of the GPU and, when comparing against previous unified architectures, the provision of more shaders to texture units is probably the way to go - 24 per cluster versus 16.

The missing MUL

There was a little controversy over how the GeForce 8-series and 9-series GPUs aggregated floating-point power.

Using the basic maths building-blocks of MADD (multiply-add) and MULs (multiply), the GPUs were reckoned to be able to dual-issue both a MADD and MUL at the same time, meaning three FLOPS (MADD counts as two) per stream processor, per clock cycle.

It transpired that the MUL part of the deal that was handled by another part of the architecture, so, accurately, the GPU could process three FLOPS per cycle when in full-blown dual-issue mode.

Why are we boring you with this, you might ask? Calculating and being able to access close-to-peak FLOPS throughput is important for gaming and, rather more so, for GPGPU tasks.

FLOPS throughput will probably be hindered by other factors such as memory bandwidth - which we'll come to - and memory access, but bigger is better.

GeForce GTX 280's 240 stream processors can still only process two FLOPS - multiply and add (MUL and ADD = MADD) - per clock cycle but a 'special-function' unit, like that on the older architectures, located within the streaming multiprocessors, can perform the additional MUL. That's a roundabout way of saying that the architecture can issue three FLOPS per clock cycle, albeit not all purely from the SPs.

NVIDIA reckons it has improved the dual-issue nature of the GTX 200-series with some behind-the-scenes tweaking. Running 3DMark Vantage may help shed some light on this.

What else is there to say about SPs, SMs and whatnot?

Each SM has its own 16KiB of local memory and can share the contents of its memory with the other two SMs in the cluster without having to run to main system memory and back again. NVIDIA reckons that this ensures greater efficiency by reducing latency - something we'll be putting to the test.

Just like G80, each cluster has its own L1 cache which connects to the ROPs - where all the back-end work takes place, including antialiasing - and then out to main card memory.

ROPs

GTX 280 carries eight ROPs (raster operators) - seen in red and blue - that can each process four pixels per clock. GeForce 8-series and 9-series GPUs have six such partitions, again capable of four ppc.

The new SKU, however, can also blend at 32 pixels per clock, whereas G80 needed to split a blender between two ROPs, leading to a 24 ROP (6x4) and 12-blend count. In short, greater than 33 per cent more ROPs and better than 3x the blending, at equal clock speeds.

Memory

Coming down further, each ROP has access to a little L2 cache and a 64-bit connection to main memory. Combining the 64-bit pathways with the eight channels, GTX 280 supports a 512-bit interface - twice that of GeForce 9800 GTX and higher than the 384-bit pushed with the original G80s. Having bags of fast memory is always good for the ROPs, especially when loaded with resolution and copious amounts of filtering.

GeForce GTX 280 is equipped with a 1GiB frame-buffer, though NVIDIA opted to go with GDDR3 DRAMs to keep costs relatively in check.

Other new bits and pieces

ATI introduced double-precision (64-bit) floating-point compliance with its Radeon HD 3000-series and NVIDIA's GTX 200-series supports it, too, via what's termed a fused-MADD. The ability is useful for the GPGPU environment where calculation accuracy is absolutely paramount.

Double-precision calculation, of course, takes away throughput - you can have one or the other at the same time. GTX 280 can operate with 30 double-precision processors - or one-eighth speed - working in a concurrent manner.

It's likely, though, that 64-bit floating-point computing will remain the domain of high-end CPUs until NVIDIA or ATI launches a specific GPU for this purpose.

Geometry shading, which is part of the DX10 specification, and Stream Out - a sub-set of the geometry shader - have both been greatly improved, according to NVIDIA.

Stream Out is handy when you want to continually re-use previously-generated data without having to re-render it, saving bandwidth and time.

NVIDIA has also increased what's termed the register files. Put simply, they're storage spaces on the streaming multiprocessors that hold the shaders, ready for execution. Super-long shaders caused the G80 to fill its register space, thus requiring main memory to lend a hand. Bottom line - larger register files mean more on-chip action and, potentially, higher performance.

There's been a general cleaning-up of the architecture, ranging from greater efficiency at memory accesses to better compression and Z-cull performance - where you occlude pixels that won't be shown - at higher resolutions.