ATI Crossfire - Technology Preview - Graphics - HEXUS.net

Crossfire Basics

Crossfire is fairly easy to explain, mostly because a lot of the complexities involved in explaining an M3DR scheme like SLI disappear with Crossfire. You'll see what I mean in due course. I'm going to make it as brief as possible because of that, while not leaving out the key concepts to understand should you wish to grasp Crossfire from a technical sense. Even if you don't really care about how it works, it's still worth at least glancing over the key details. I'll try to highlight those as I go along, so even the least interested or technically minded of you can pick it all up.

Let's start with how joining more than one recent ATI graphics processor works, to produce a frame of output that both have worked on.

Supertiling

Everyone that writes Crossfire up should explain Supertiling to you, since it's the key element to the entire M3DR scheme working. When any recent ATI graphics processor - from the R300 (which powered the Radeon 9700 and Radeon 9500-series of products), right up to the current ATI flagship GPU, R480 (Radeon X850-series) - renders a 3D scene, it has done so by splitting the scene up into tiles. Those tiles, usually 16x16 or 32x32 pixels in size, cover the entire screen, making up the frame from a pixel mosaic. Those mosaic tiles are processed by the pixel engine and pixel output pipelines in the Radeon GPU (fragment processor and pixel ROP). To understand that, here's a quick refresher course on the basic building blocks of a modern immediate-mode 3D processor. Skip this bit if you know how an IM-focussed 3D processor works, in basic terms.

The CPU and graphics driver work together to feed the GPU with geometry data, in the form of triangle primitives. Triangles are the basic building blocks of all geometry you'll see on your screen, in any modern 3D accelerator. Want to display a sphere, cylinder, box, or any other geometric shape, on a computer screen? It's built from tris. The GPU processes those tris in its vertex processors. The vertex processors are complex mini-processors made up of a combination of vector (basically a point and direction in 3D space) and scalar (how big the vector is) arithmetic units (ALUs). They work together in a SIMD or MIMD fashion to output triangles to the triangle setup engine.

The triangle setup engine converts the triangle batches to pixel fragments via specialised silicon called the rasteriser. Each fragment is assigned a set of parameters and attributes that tell the fragment processor (the correct term for the pixel processors, since they operate not on whole pixels, but pixel fragments) things like what colour the fragment is, and what fragment programs to run for that particular fragment. This is where things become relevant for Supertiling, since the fragment units operate on pixel blocks, called quads. Quads are a block of 2x2 pixels. Modern immediate-mode render architectures, like NV40 and R480, operate on quads for reasons of efficiency and ease (relatively speaking) of design.

Processed fragments, output by the fragment units, are processed by the ROPs, which perform functions like colour combining and sampling, anti-aliasing (Z-sampling) and buffer blends, before writing the pixel out to the output buffer, for display on your screen. The entire process is then repeated as fast as it can. There are huge amounts missed out in all three stages (vertices -> pixels -> ROPs), but that's the basics.

That grouping of pixel fragments into quads, and then screen tiles, by the rasteriser, is how Supertiling works. With one GPU, that GPU processes all the screen tiles, effectively Supertiling on its own. With more than one GPU involved, though, each one gets a split of the tiles to work on, with the final output combined at the end so you can see it. Since the tiling takes place after rasterisation, it has an impact on overall performance. I'll explain that shortly.

The important thing to understand is that everything after rasterisation can be accelerated in the M3DR scheme that Supertiling allows, that Crossfire implements.

How the Supertiling mode of Crossfire affects performance

Good question. Obviously, if the tile rendering acceleration only happens after rasterising the fragments, everything before that is unaccelerated. With Crossfire, or any other Supertiling-esque M3DR implementation, all geometry is passed to each GPU that's participating. That obviously means that geometry performance can't scale absolutely. If each GPU has to process all of the geometry that all the others are working on, how can they accelerate the creation of rasterised fragments?

ATI optimise what each rasterisation unit works on by discarding fragments that'll never be processed, inside of the tiles that each GPU is being asked to render. Basically the GPU interrogates the fragment to find out where it lies in screen space. If the fragment overlaps or lies completely inside the tile boundary for any of the tiles the GPU is processing, it keeps it to process. If not, it's discarded and no further processing is done on it, saving valuable bandwidth and processing power.

So geometry performance can't traditionally scale with Supertiling, since all tris must be at least analysed by all GPUs, but the end result can be calculated faster. If you've been paying attention, you'll also have spotted the absolutely key point for ATI's positioning with Crossfire. Absolutely all 3D operations performed at the pixel fragment level and above on a Radeon GPU are done on screen tiles. Which means all your current games and applications are rendered in this tiled fashion on a Radeon GPU as we speak, and are accelerated just fine. Further, that means, with (hopefully) a very small number of exceptions, all games titles will be automatically accelerated by a Crossfire setup. No profile list to turn it on for games, just CATALYST A.I. to turn it off or adjust the rendering mode, if needed.

The main reasons why Crossfire won't be enabled for a game or application are mainly explained by what happens to image quality in a Crossfire setup. Let me explain that in more detail.

How Supertiling with Crossfire affects image quality

Since the ROP units operate on resolved pixels output after fragment processing of screen tiles, and the ROP units are where anti-aliasing is performed, sampling pixel depth, image quality from multisample anti-aliasing can be increased. Any Radeon GPU from the R300 upwards has a sample grid (where the hardware knows to sample inside of a pixel) that's 12x12 subpixels in size. From that 144-position grid, samples are chosen by the hardware for depth sampling the pixel to be processed. Check out the sample grids for R300 and higher hardware, here, and the explanation about how "temporal" anti-aliasing works, here.

With Supertiling, something similar to "temporal" can happen. All tiles are rendered by all GPUs, but the depth sample grids are different for each GPU. After processing, the resulting sample data is combined, increasing the number of samples per pixel. So while the maximum number of multisamples per GPU doesn't increase, the effective number of multisamples does, by a multiple of the number of GPUs participating. For a dual-board Crossfire solution based on X800 or X850, that's 12 multisamples per pixel from that 144-position grid (6 samples each). In other words, 12X AA. ATI call that Super AA. Join me in a groan.

If you're in Supertiling mode, using Superduper AA, you can also mix in supersampling with the multisample antialiasing, to antialias texture (colour) data, too. A quick refresher for those a little confused just now, especially since recent ATI hardware has never offered supersampling as an option in PC drivers: multisampling is geometry antialiasing using depth sampling, not supersampling which antialiases using texture colour sampling). You can sample the texture twice per pixel (2X RGSS) along with the sparse-grid multisampling you're doing in Super AA mode. Twelve geometry samples and two texture samples is apparently 14X AA, according to ATI.

To be fair, they've seemingly named it that way to make it easy for the consumer to understand. However, if you want to get it right, call it dX SGMS plus 2X RGSS, where d is the number of multisamples used across all boards in Supertiling mode with Super AA. SGMS is sparse-grid multisampling and RGMS is rotated-grid supersampling.

That also affects performance. In most cases, you can likely double your AA level at at least the same framerate, for increased image quality at no speed penalty (given identical boards).

Can't you accelerate geometry performance somehow?

Yeah, you can, but not with Supertiling. ATI apparently own the patent to the alternate frame rendering method you can apply to M3DR schemes, which NVIDIA uses with SLI. ATI offer AFR as a mode to pair with Crossfire, too. AFR avoids a number of performance pitfalls available with other M3DR modes like Supertiling and SFR, since there's no load-balancing to perform, just buffering of frame data to keep all GPUs busy as much as possible. However, you lose the ability to increase image quality in AFR mode, like you can with Supertiling. So for titles where you're geometry limited in some way and you want to use AFR, you can't get more than 6X anti-aliasing and it seems there's no texture AA available either, although we'll see.

Any other modes?

Along with Supertiling and AFR, there's a mode called Scissor. Scissor chops the screen horizontally or vertically, with each GPU getting a section to render. I'd imagine that the break is aligned on a screen tile boundary but it shouldn't have to be fixed at 50/50. While it's not adaptive and doesn't change the split per frame, it does allow extra flexibility in the rendering modes available and the use of Scissor where other modes don't work. Those cases should be rare however and Scissor looks likely to be the least used Crossfire rendering mode.

Combining rendered output

Like NVIDIA's SLI, there's a master/slave(s) board setup with Crossfire. With SLI, the inter-GPU connector takes care of framebuffer compositing; frame data from the slave is passed over the connector to the master, for joining. With Crossfire, an external connector feeds the output data from the slave into the master board's compositing hardware. ATI are unspecific about where the compositing engine is on Crossfire boards (more on which later), be it on-GPU die silicon or something else (external chip?), but there's hardware that performs the compositing at full speed for both master and slave boards, before output. It's all done digitally, too. You feed DVI output from the slave into the master, where it's composited, before output via a DVI connector (mabe HDMI in future Crossfire hardware?) so you can run a display with it. More on that on the next page.

Sum it up for me, Rys?

Sure. Radeon's screen tiling approach to rendering, accelerating almost everything after rasterisation if you add more GPUs, seemingly gives you compatibility with a large amount of games and can increase image quality for little to no performance cost. Screen tiles are processed by a particular GPU (or all, when anti-aliasing is being done after fragment processing) and the outputs from all are combined for display by dedicated hardware, accelerating performance and, optionally, increasing image quality.

Yay for better IQ and more performance (a given), and yay for ATI giving the consumer more choice when considering their own M3DR setup, but how is Crossfire implemented? I've given you most of the clues on this page (external cabling, master/slave boards), so I'll bring all that info together on the next page. Clickety click.

ATI Crossfire - Technology Preview