Memory Bandwidth Limitations on Current Architectures
The price and
performance of current graphics cards is, almost without exception, largely
dictated by the speed and amount (and therefore cost) of onboard memory. At the
value end of nVidia's GeForce 2 range is the GF2 MX, with 32Mb of either 143MHz
DDR memory on a 64-bit bus (the slightly cheaper and slightly slower option) or
166MHz SDR memory on a 128-bit bus. Their flagship product, the GF2 Ultra,
features 64Mb of DDR memory clocked at 230MHz on a 128-bit bus, offering over
three times the bandwidth of the GF2 MX DDR and costing around four times as
much.
It is, of course, widely known that even the 7.2Gb/s of bandwidth offered by the
Ultra's 230MHz DDR memory, the fastest and most expensive SDRAM currently
available, is not sufficient to exploit the full potential of the Ultra's core
in 16-bit colour, let alone in 32-bit. In a reversal of the past approach to
graphics card engineering, chips are now being designed to ensure that all
available memory bandwidth is used - memory bandwidth has become the major
bottleneck, rather than the core's clock speed or number of texture units and
pixel pipelines.
Graphics chip manufacturers therefore need to start exploring alternative
methods of increasing effective (this being an important phrase) memory
bandwidth, so that the fill-rate efficiency of graphics cards can be restored to
acceptable levels. The standard 32Mb GF2, for example, attains a real-world
fill-rate (number of pixels the card is capable of rendering in a fixed period,
usually measured in megapixels, or millions of pixels, per second) that is under
40% of it's theoretical peak value in 32-bit colour mode. This is measured using
the multi-texture fill-rate test in 3D Mark 2000, which represents the best case
scenario; in games, and particularly in texture-intensive situations, efficiency
is even lower.
Unfortunately, SDRAM, or Synchronous Dynamic RAM, technologies (SDR, DDR and
potentially, in the future, QDR) are progressing at a much slower rate than
graphics chips, so relying solely on advances in this field will only widen the
gulf between real-world and theoretical fill-rate figures. Also, while using
faster SDRAM can have a profound effect on performance (witness the GF2 Ultra),
it also has a similarly profound effect on cost (again, refer to the Ultra), and
is only effective within the limits of current SDRAM yields. QDR (quad data
rate) memory will help, but by the time it is available at reasonable cost,
other bandwidth-increasing methods will need to be employed to ensure a
reasonable level of efficiency.
There's also the option of scaleable multi-chip solutions, as seen with the
Voodoo 5 5x00 and the unreleased 6000 - these benefit from the considerable
bandwidth increases associated with each chip having its own memory interface,
but suffer from the need for textures to be duplicated in the memory space of
each chip and a tendency to be expensive. This the route that 3dfx planned to go
down with it's chip codenamed 'Rampage'; sadly, the architecture is unlikely to
ever see the light of day, but the dual- chip edition would have been quite
capable of vastly outperforming nVidia's forthcoming NV20 GPU had nVidia chosen
not to incorporate any bandwidth-saving technologies (which, luckily, they
have).
A mere mortal such as myself could barely comprehend the potential performance
level of the hypothetical four-chip version - even less so the number of digits
in the price tag that such a card would have commanded (I'd suggest close to
four).
Another idea, favoured by Finnish company Bitboys as part of their extensive
range of vapourware (they've yet to demonstrate any working silicon), is eDRAM,
or Embedded Dynamic RAM. This ultra-high bandwidth memory, actually embedded
onto the core of the graphics chip (comparable to a bigger, slower L2 CPU
cache), operates on a 512-bit wide bus; by contrast, SDRAM becomes very
difficult to effectively implement on a bus wider than 128-bit, so eDRAM offers
four times the bandwidth of SDRAM at a given clock speed.
The PlayStation 2 features an unbelievably generous 4Mb of this type of memory,
and Nintendo's Game Cube is believed to include an even more stunningly bloated
3Mb eDRAM on it's graphics processor. The latter GPU is to be designed by ATI,
so don't be surprised if eDRAM appears in some of their future PC graphics
cards.
The disadvantage of eDRAM, as evidenced by these tiny figures, is it's massive
transistor count, making it impractical to use in large amounts. Bitboys plan to
incorporate 12Mb of eDRAM into each chip based on their 'XBA' (Xtended Bandwidth
Architecture) design - another Voodoo 5-style multi-chip scaleable solution -
plus up to 64Mb of standard SDRAM to store the textures and the overflow from
eDRAM. However, as a result of this and their continued failure to provide any
evidence of functional hardware, some have questioned whether Bitboys ever have
engineered, and ever will manufacture, any working graphics chips. Such cynicism
aside, theirs is a name worth remembering, as one day they may actually release
a competitive product. If they ever do, it is sure to feature eDRAM.
Probably the most interesting of these, theoretically having a dramatic impact
on memory bandwidth consumption, is known as 'deferred rendering'. Any of you
who, by some chance, may actually have read my previous column will recall that
I briefly described the 'hidden surface removal' (HSR) that is strongly rumoured
to feature in the NV20 core. This is actually a form of deferred rendering,
albeit one that is probably not as efficient as the technique I am about to
mention. The only cards presently available that are capable of deferred
rendering are those in the PowerVR series, most recently Videologic's Vivid!
(which I also referred to in passing last time). These cards use 'tile-based
rendering', a phrase which you may have encountered before.
Deferred rendering is distinct from all the above means of remedying the
bandwidth problem in that it increases the efficiency of bandwidth consumption,
rather than just making more bandwidth available. In other words, it
considerably increases the 'effective' memory bandwidth without increasing the
actual bandwidth. I will not explain the difference between deferred renderers,
in particular tile-based renderers ('tilers'), and so-called 'traditional'
renderers now, but will leave it for my next column. Sadly the Vivid! itself is
an underpowered chip, so does not compete well with the latest products from
nVidia, 3dfx and ATI, but it's architecture shows a lot of promise for the
future, and is worthy of deeper investigation.
A final word - it is not impossible for all the methods mentioned above to be
incorporated into a single, scaleable chip design. Anyone interested in a
four-chip, deferred tile-based renderer featuring both eDRAM and and QDR SDRAM?
To find out more about Bitboys' alleged XBA technology, look at http://www.bitboys.com/xba.html
To read about ST Microelectronics' PowerVR architecture, go to http://kyro.st.com