facebook rss twitter

NVIDIA's GeForce 6200 with TurboCache Preview

by Ryszard Sommefeldt on 15 December 2004, 00:00

Tags: NVIDIA (NASDAQ:NVDA)

Quick Link: HEXUS.net/qa5y

Add to My Vault: x

NV44's Memory Subsystem - TurboCache

NV44's memory subsystem, especially the memory controller and new peripheral memory management unit (MMU) is a new departure for NVIDIA. Under the TurboCache marketing umbrella, the memory subsystem uses PCI Express to allow read and write of local system memory by the GPU. It's not quite AGP, since AGP could never texture back into memory, only read from AGP locked memory to the GPU or dump the entire framebuffer back on the host, whereas using PCI Express, NV44 can use local memory as a fully read/write data store for a number of data objects it routinely works with.

It's made possible by integration of a memory management unit (MMU) into NV44's die, that can actively map, lock and free local system memory and allow the GPU to see it as local card memory. That's paired with one or two DRAMs on the board itself (16MB/128Mbit), which sit on dedicated 32-bit memory paths to NV44's memory controller.

So that gives you two TurboCache NV44 variants, depending on whether there's a single 16MB DRAM on the board, or a pair of them. The 16MB minimum requirement is born out of NV44's need to do scan out (copying the front buffer to your display device) from local card memory. The GPU can't take the chance that the front buffer could be out in system memory, so it needs at least enough memory to hold the front buffer locally. With a 1600x1200 display surface, in 32-bit colour, nearly 8MB, you can see why a standard 32-bit bus DRAM, 16MB in size, is the minimum.

So with a single DRAM on board, the MMU maps that 16MB and a further 112MB of dynamically allocated local memory for 128MB in total. NVIDIA quotes memory bandwidth here in a funny way, counting the PCI Express bus the NV44 sits on as a memory bus, along with the bandwidth of local memory devices. It's not quite that simple, since the entire bus bandwidth can't ever be used entirely for off-board memory accesses. The system memory bus has less bandwidth than the PCI Express bus. Naughty NVIDIA. Regardless, given a 32-bit local bus width, NV44 with a single 700MHz memory device has 2800MB/sec of local bandwith.

Two devices doubles that to 5600MB/sec, since they each occupy a memory segment/bank on the memory controller. The MMU can also map a combined total of 256MB using a 64MB TurboCache board. Each occupied memory segment allows for mapping of more local memory essentially.

So you can texture to and from system memory with PCI Express. Say you want to create a high-resolution depth map for projected shadowing, in your game. If your map is a 2048x2048 surface with a 16-bit data type for depth, that's 16MB right there, never mind basic texture surfaces and other data the game engine needs to draw the next frame. Games that do create surfaces like that still have to run on the 16MB TurboCache 6200, so the GPU has to be able to create that writeable surface off-board, in system memory. There's an absolute requirement for writes into sections of off-board memory, which is where PCI Express comes in. A 16MB AGP board would likely fail here.

The MMU just translates the memory address of the texture surface in system memory and allows the NV44 to access it as if it were local memory, pushing the request out over the PCI Express bus as needed. It's a memory management unit, rather than a memory map unit, since it can free and delete those memory sections it has previously allocated and locked in system memory, rather than just map them after they're permanently locked, something that other integrated graphics solutions will do (and AGP in some respects).

So if you need to access local memory from the GPU in the system, you need to be able to do so as fast as possible. Being a PCI Express only product (there can't be an AGP 6200 TurboCache for the reasons outlined above), you're currently at least guaranteed a dual-channel memory controller either on the processor, or on a memory controller on the northbridge. Athlon 64 is the platform of choice here, for TurboCache 6200. With the memory controller on the die, access latency is the lowest in the business. It's also the most efficient, giving more bandwidth at the same memory clocks than other memory controllers available for PCI Express x86 PC systems.

So double 64-bit channels gives you 6400MB/sec maximum theoretical bandwidth from the memory controller with that sitting at roughly 5500MB/sec on the most efficient Athlon 64 memory controller. So NVIDIA's claim that the PCI Express bus is a true 8000MB/sec memory bus for the GPU is a little wide of the mark. The memory controller on the other side of that bus is a wee bit slower.

TurboCache Benefits and Downsides

So being able to read and write arbitrarily from system memory has its downsides. There's extra die space needed for the MMU and associated logic, although NVIDIA save die space by cutting half the fragment and ROP units out, compared to NV43. Since transistors dedicated to memory and interface logic are usually outlying on the GPU, and NVIDIA have taken the time to do a die space reshuffle, putting what I'd assume to be a fairly sizeable MMU on the die has still allowed NVIDIA to save plenty of die space compared to NV43.

Since you only need one or two pieces of very cheap DDR memory on the board, the real killer for AIBs in terms of board cost in the low-end - memory devices - is massively reduced. At the low-end, memory device costs are often the most expensive part of the product, so saving that cost in the manner that NV44 does, makes a fair bit of sense.

So the GPU is costing NVIDIA less from TSMC compared to NV43, which allows them to make a bit more per GPU, which in turn suits them since TurboCache 6200 is likely to be a massive seller, and the AIB makes a bit more per board due to memory device cost savings. And you the consumer saves too, since even though further up the chain everyone is making a few dollars more per board or GPU, the overall cost saving is being passed your way.

So there's monetary benefit for everyone (you save more, NVIDIA and the AIBs make a bit more), but there's also some downsides. The main downside is obviously performance. Accessing system memory is slower than accessing local card memory, from a latency point of view. Without a 128-bit bus to local card memory, there's likely going to be less bandwidth available than there is with system memory. That's obviously going to be the case with TurboCache, with its small local memory bus widths. So compared to the NV43-based 6200 at the same clocks with the NV43 board having its 128-bit local memory bus, a TurboCache board with even less local memory is maybe going to be slower, you just can't avoid that.

But performance is about all that suffers relative to the more expensive option of NV43 based 6200 with it's on-board memory. TurboCache 6200 boards may be potentially slower, but they're also small (no need for multiple DRAM spaces and traces), don't need external power sources or memory cooling of any kind and as you'll see, don't even need active cooling on the GPU. Does cheap, small, cool and quiet Shader Model 3.0 appeal to you initially? Turn the page for a look at the reference boards.