Depth Stencil Textures and Percentage Closer Filtering
Depth stencil textures are used by the hardware to generate the depth map needed by the shadow algorithm, created by rendering the scene from the perspective of the light that's casting the shadow. Imagine for a second that the light source has eyes and can see the scene from its viewpoint. It'd see objects far away from it and near to it and everything in between, with a clear view of their distance, and therefore depth, from its point of view. All that depth information, the distance from the light's viewpoint to the objects it's looking at, gets saved in a block of memory (the depth texture) and stored in the graphics card's memory. It doesn't save any information about the colour of objects, just how far away they are.Then when you draw the scene from the perspective of you the viewer, without lighting applied, you've got all the pixels you'll see from your perspective, from your eyes right out to the furthest object visible on the screen. It looks a bit bad without lighting since everything is the same brightness, but at least you can see depth. So now we want to render the lighting. Shadows are lights if you think about it, just places where the light is blocked and so its intensity is lessened. So by the same token, rendering the lighting on top of our scene means rendering the shadows too.
So what the hardware does is project the depth texture saved earlier from the point of view of the light, onto the scene that you can see. That projection is done by transforming the view matrix of the depth texture so it's equal to that of the view matrix defining the view you can see on the screen. You basically overlay your view of the world with the depth information from the view of the light. Then you compare the view you can see with your eyes to the new one projected onto the scene from the light's 'eye'.
The comparison is done by sampling from the depth texture at the point (x, y and z coordinates of the pixel fragment you're testing) you want to see if a pixel fragment is in shadow or not. If the depth of the pixel fragment you're testing for shadow (let's call that 'A') lies beyond the depth of the sample stored in the depth texture at the same point after projection (let's call that 'B'), you're in shadow. Simply, if the sample from the scene you can see is further away than the information stored in the depth map, B > A, you're in shadow.
So you can see how it works. Draw the scene twice. Once fully from the point of view of you the viewer and once with only depth information generated, from the point of view of the light that's going to cast shadows.
Optimising it
It's the acceleration of that depth texture render that DST-accelerating hardware does, for the speedup. Since you're only rendering depth information, you turn off all colour writing inside the hardware, saving the bandwidth and instructions needed to generate those colours. You can programatically do that in DirectX and OpenGL when setting the render state (the information the OpenGL or DirectX rasteriser uses to draw your scene).You can also optimise the transformation of the texture projection in the hardware (when the depth map is tranformed from object space to eye space as fast as possible). And what also happens in hardware that can accelerate this method of texture rendering is that the lookup into the depth map, the comparison (that generates 1.0 or 0.0 depending on whether you're in shadow or not for that particular fragment) and the modulate for the final value returned, are done in a fast path. The hardware also optimises certain texture operations (copies mainly, it seems) for textures created using DST-aware surface formats.
So there's a bunch of stuff being done on DST-accelerating hardware, optimised to take as little time as possible in the hardware, that isn't done on hardware that doesn't support accelerated DSTs.
On hardware that can't do accelerated DSTs, the hardware can't take shortcuts so to speak, while doing the lookup into the depth map, the comparison, the texture projection using the depth map and the other optimisations.
Current NVIDIA are the only IHV shipping DST-accelerating hardware that works using the semi outside-of-DX9 method that 3DMark05 uses.
Percentage Closer Filtering
NVIDIA's hardware uses percentage closer filtering to sample the depth map to determine if the pixel fragment is in shadow or not. Percentage closer filtering is a weighted average of four samples from the depth map, in and around the fixed area between the depth value from the depth map, and the depth value of the pixel fragment. So you sample four times, calculate the average value of the four samples, then weight that result. That's a basic weighted bilinear filter, in hardware.That method gives you a value that tells you how much you're in shadow or not for a given pixel fragment, bilinearly filtered for the samples it took from the depth map.
The alternate method is to point sample inside the depth map and interpolate the result linearly, four times. That's done in the fragment shader on hardware that doesn't support PCF, in 3DMark05, and it's more costly in terms of performance, due to the work done to perform the sampling.
While other upcoming hardware might support accelerated DSTs, they won't do hardware PCF, leaving that as an NVIDIA-only feature.
There's also the question of quality which I touched on in the first article. NVIDIA's hardware implementation of PCF is fixed. It samples from the same space each time and returns fixed values (a set of basic stepped values between 0 and 1). Do the sampling in the pixel shader and you can adjust the sample space for better filtering quality on the shadow edges.
That's the main crux of why NVIDIA's PCF will only look good in certain cases, and bad in others. Do it in a shader and you can adjust the sample space and tweak the quality (per pixel if you wish).
Will it be a part of DirectX in the future?
It appears that the answer to that question is one of intellectual property. NVIDIA, via patents aquired from SGI and work done with them (it's not just XBox and GeForce3+ that supports accelerated DSTs, SGI's InfiniteReality and RealityEngine hardware does too :P), seem to own the IP rights to doing accelerated DST rendering in hardware. While it's true that another vendor looks to be supporting it in the future (S3 in a DeltaChrome part), NVIDIA still seem to own the rights to any implementation. If they license that IP to Microsoft, for inclusion in DirectX, other vendors would pick it up and support it (after ratification and support from Microsoft and all the IHVs that govern DirectX hardware features). It's unclear whether that's going to happen though.And supported in DirectX or not, the method of supporting accelerated DSTs in a DirectX 3D engine seems trivial enough, but prone to failure in some cases. We've already seen it with Far Cry, another title that supports NVIDIA's accelerated DST rendering for shadowing. Before Far Cry was aware of ATI's X800 boards via detection of vendor and device IDs on the hardware itself, it treated them as NVIDIA boards using a fallback path inside the renderer, attempting to utilise DSTs and breaking that render path.
If you forced the X800 to appear as a Radeon 9500+ using faked device ID, you fixed the issue. A patch was needed for proper support. Failure like that inside a game engine, especially one that speculates on support via detection of hardware using vendor and device IDs, seems to be all too easy to achieve.
The argument for full inclusion in DirectX, which gives you easy enough caps checking via standard DirectX caps checking interfaces (it's not just enough to ask DirectX if D24X8 is a valid depth texture format, since that test might pass on unsupporting hardware that can use that surface format for something else), is therefore logical.
Easy for the developer to include it may be, but it's also prone to failure depending on how you detect supporting hardware. If FutureMark's claims that their shadow rendering technique is going to be prevalent in upcoming games titles is correct, we hope that developers implement a robust method for proper detection for as long as the feature falls outside of DirectX proper.
Overall
Hopefully you've now got a more in-depth understanding of the optimisations NVIDIA do in their hardware with regards to DSTs and PCF (turn off colour writes, optimise the perspective texture mapping calculations, be quick when copying depth textures around in card memory, do single-cycle PCF filtering and quick modulate of the result).Hopefully you can also see why PCF is fixed quality. While the resolution of the depth map has a lot to do with output quality of the shadow edges (maybe more so than the filtering method!), the filtering method plays its part. Ultimately, doing it in the fragment shader is higher quality and the speed hit is tolerable (and as you'll see shortly, there might be no speed hit at all on comparable hardware from another IHV!).
Finally, I can't really answer the question of whether acclerated DSTs will make it into DirectX or not. But I do know that since it's outside of the official spec, implementation is prone to failure due to card detection, and therefore you get incorrect or totally broken rendering if the developer isn't too clever.
With that out of the way, let's get down to the important matter of seeing just what boards are the fastest at 3DMark05.