facebook rss twitter

Review: NVIDIA's SLI - An Introduction

by Ryszard Sommefeldt on 22 November 2004, 00:00

Tags: NVIDIA (NASDAQ:NVDA)

Quick Link: HEXUS.net/qa4u

Add to My Vault: x

Split-frame Rendering

Split-frame rendering is the more attractive, from a gamer's point of view. Here's why.

Since both cards are working together on the same frame, there's no extra latency involved in an SLI system. By latency, I mean time delays between user input and that input being visibly conveyed on the screen. For fast-paced games (and I don't mean frame rate, although it's desirable for the two to go hand in hand), that's a paramount consideration. You want to produce some player input and that input be shown on the screen as fast as is technically possible. And with SFR, that's the case. There's no latency introduced to the process with SFR, over a single GPU system.

The driver manages the rendering resources needed to produce a frame of render output, sending the correct resources (which can include any of the data needed for the render output at any time in the render process) and graphics commands to the GPUs, to have them render their part of the image.

Clipping is used to split the screen up. That means geometry converted to pixel fragments, the data required to render those fragments and any other resources aren't sent to a GPU that won't be rendering them.

However, in modern rendering environments, for example that provided by the DirectX 9.0 render path in the Source engine, there are a lot of off-screen resources for the driver to consider. You've got multiple rendering passes used to draw the frame, from bump maps, to gloss maps, and they're almost all rendered to off-screen surfaces called render targets.

Those render targets, simple texture surfaces which the GPU can draw into, aren't displayed on screen as soon as they're created. Rather they're held off-screen, combined with any other render targets needed to achieve correct rendering, and then swapped to the front buffer at the end of the render sequence. I'd also speculate that more often than not, those render targets are the same size as screen space. So if you're running your Source game at 1600x1200, most of the composite render targets used to draw the frame are also 1600x1200 in size with an associated data format.

That means that the render targets are applicable to both top and bottom of the screen, regardless of wherever the SFR split is contained. So the off-screen resources need to be managed, so that each GPU is rendering into the right section of the resource, as they combine their power to render the frame. And the split in SFR mode is load balanced, so it might be the case that GPU1 is rendering into the top 30% of a render target, GPU2 into the bottom 70%. But that's more than likely going to change for the next frame, as the load is distributed. If there's complex render operations happening like perspective texturing, where the output is being rendered into the target following a matrix transformation, those matrix transforms need to be done on both GPUs, for each frame, for any possible load split.

Geometry creation, at the beginning of the render pipeline, is also a complex thing for the driver to manage. With the GPU rasterising the geometry data into pixel fragments, after the driver has decided what GPU it's being rendered on, before the GPU decides just how much geometry it's discarding or rendering, and how many pixel fragments any Z-buffer optimisation scheme is going to discard, you can see why it's a complex operation even just on a single GPU.

With the driver managing all of that, and synchronisation commands and data sent over the inter-GPU link (more on which later), you can see where NVIDIA's claimed ~1.8x average speedup in SFR mode comes from. There's significant overhead in the SFR scheme, something that prevents it from ever offering a straight 2x performance increase. And depending on the render setup, that performance increase could actually be closer to a performance decrease, as the driver and management overhead for a pair of GPUs outweighs any performance advantage their might have been.

And that's not even discussing pixel fragments that cross the load-balancing line. You can have long, thin triangles that cross the border. Since triangles are the rendering primitive employed by today's accelerators, the fragments that make up those triangles, and all the peripheral data associated with them, have to be shared between both GPUs. Extra overhead for the driver to manage.

So while SFR is most desirable from a latency standpoint, there's a simpler means to extra performance in a multi-GPU system, but one which introduces some potential extra latency.