Memory Bandwidth Limitations on Current Architectures

The price and performance of current graphics cards is, almost without exception, largely dictated by the speed and amount (and therefore cost) of onboard memory. At the value end of nVidia's GeForce 2 range is the GF2 MX, with 32Mb of either 143MHz DDR memory on a 64-bit bus (the slightly cheaper and slightly slower option) or 166MHz SDR memory on a 128-bit bus. Their flagship product, the GF2 Ultra, features 64Mb of DDR memory clocked at 230MHz on a 128-bit bus, offering over three times the bandwidth of the GF2 MX DDR and costing around four times as much.

It is, of course, widely known that even the 7.2Gb/s of bandwidth offered by the Ultra's 230MHz DDR memory, the fastest and most expensive SDRAM currently available, is not sufficient to exploit the full potential of the Ultra's core in 16-bit colour, let alone in 32-bit. In a reversal of the past approach to graphics card engineering, chips are now being designed to ensure that all available memory bandwidth is used - memory bandwidth has become the major bottleneck, rather than the core's clock speed or number of texture units and pixel pipelines.

Graphics chip manufacturers therefore need to start exploring alternative methods of increasing effective (this being an important phrase) memory bandwidth, so that the fill-rate efficiency of graphics cards can be restored to acceptable levels. The standard 32Mb GF2, for example, attains a real-world fill-rate (number of pixels the card is capable of rendering in a fixed period, usually measured in megapixels, or millions of pixels, per second) that is under 40% of it's theoretical peak value in 32-bit colour mode. This is measured using the multi-texture fill-rate test in 3D Mark 2000, which represents the best case scenario; in games, and particularly in texture-intensive situations, efficiency is even lower.

Unfortunately, SDRAM, or Synchronous Dynamic RAM, technologies (SDR, DDR and potentially, in the future, QDR) are progressing at a much slower rate than graphics chips, so relying solely on advances in this field will only widen the gulf between real-world and theoretical fill-rate figures. Also, while using faster SDRAM can have a profound effect on performance (witness the GF2 Ultra), it also has a similarly profound effect on cost (again, refer to the Ultra), and is only effective within the limits of current SDRAM yields. QDR (quad data rate) memory will help, but by the time it is available at reasonable cost, other bandwidth-increasing methods will need to be employed to ensure a reasonable level of efficiency.

There's also the option of scaleable multi-chip solutions, as seen with the Voodoo 5 5x00 and the unreleased 6000 - these benefit from the considerable bandwidth increases associated with each chip having its own memory interface, but suffer from the need for textures to be duplicated in the memory space of each chip and a tendency to be expensive. This the route that 3dfx planned to go down with it's chip codenamed 'Rampage'; sadly, the architecture is unlikely to ever see the light of day, but the dual- chip edition would have been quite capable of vastly outperforming nVidia's forthcoming NV20 GPU had nVidia chosen not to incorporate any bandwidth-saving technologies (which, luckily, they have).

A mere mortal such as myself could barely comprehend the potential performance level of the hypothetical four-chip version - even less so the number of digits in the price tag that such a card would have commanded (I'd suggest close to four).

Another idea, favoured by Finnish company Bitboys as part of their extensive range of vapourware (they've yet to demonstrate any working silicon), is eDRAM, or Embedded Dynamic RAM. This ultra-high bandwidth memory, actually embedded onto the core of the graphics chip (comparable to a bigger, slower L2 CPU cache), operates on a 512-bit wide bus; by contrast, SDRAM becomes very difficult to effectively implement on a bus wider than 128-bit, so eDRAM offers four times the bandwidth of SDRAM at a given clock speed.

The PlayStation 2 features an unbelievably generous 4Mb of this type of memory, and Nintendo's Game Cube is believed to include an even more stunningly bloated 3Mb eDRAM on it's graphics processor. The latter GPU is to be designed by ATI, so don't be surprised if eDRAM appears in some of their future PC graphics cards.

The disadvantage of eDRAM, as evidenced by these tiny figures, is it's massive transistor count, making it impractical to use in large amounts. Bitboys plan to incorporate 12Mb of eDRAM into each chip based on their 'XBA' (Xtended Bandwidth Architecture) design - another Voodoo 5-style multi-chip scaleable solution - plus up to 64Mb of standard SDRAM to store the textures and the overflow from eDRAM. However, as a result of this and their continued failure to provide any evidence of functional hardware, some have questioned whether Bitboys ever have engineered, and ever will manufacture, any working graphics chips. Such cynicism aside, theirs is a name worth remembering, as one day they may actually release a competitive product. If they ever do, it is sure to feature eDRAM.

Probably the most interesting of these, theoretically having a dramatic impact on memory bandwidth consumption, is known as 'deferred rendering'. Any of you who, by some chance, may actually have read my previous column will recall that I briefly described the 'hidden surface removal' (HSR) that is strongly rumoured to feature in the NV20 core. This is actually a form of deferred rendering, albeit one that is probably not as efficient as the technique I am about to mention. The only cards presently available that are capable of deferred rendering are those in the PowerVR series, most recently Videologic's Vivid! (which I also referred to in passing last time). These cards use 'tile-based rendering', a phrase which you may have encountered before.

Deferred rendering is distinct from all the above means of remedying the bandwidth problem in that it increases the efficiency of bandwidth consumption, rather than just making more bandwidth available. In other words, it considerably increases the 'effective' memory bandwidth without increasing the actual bandwidth. I will not explain the difference between deferred renderers, in particular tile-based renderers ('tilers'), and so-called 'traditional' renderers now, but will leave it for my next column. Sadly the Vivid! itself is an underpowered chip, so does not compete well with the latest products from nVidia, 3dfx and ATI, but it's architecture shows a lot of promise for the future, and is worthy of deeper investigation.

A final word - it is not impossible for all the methods mentioned above to be incorporated into a single, scaleable chip design. Anyone interested in a four-chip, deferred tile-based renderer featuring both eDRAM and and QDR SDRAM?

To find out more about Bitboys' alleged XBA technology, look at http://www.bitboys.com/xba.html

To read about ST Microelectronics' PowerVR architecture, go to http://kyro.st.com

Memory Bandwidth Limitations on Current Architectures

Memory Bandwidth Limitations on Current Architectures

Related Reading

MY HEXUS

EVENTS

INDUSTRY PRESS RELEASES