Hardware Analysis
We've discussed what NV40 does, along with a quick piece on the Shader Model 3.0 spec in DirectX 9.0c which it supports, since that covers hardware features in practical use. But how does it do it?I'm loathe to use NVIDIA's press diagrams for the hardware, so instead I'll describe it without them, hopefully you can follow it without much problem.
Pixel Shader
NV40's pixel shader functional groups, of which there's one per pipeline, consist of two FP32 shader units, a texture unit, the branch processor for flow control and looping, and a small fog arithmetic unit.Only the first FP32 shader unit has access to the texture unit but both have identical basic specs. That means a maximum of 4 pixel shader instructions, per clock. The shader unit can be dual or co-issue. Dual-issue means processing two pixels at once, doing two instructions on each, per clock. Or it can be co-issue, halving performance, doing a pair of instructions on one pixel, per clock. In either mode, NV40 can split the pixel's four components into either 3+1 or 2+2. In 3+1 mode, one pixel shader instruction can operate on three components and one instruction operates on the remaining one. The same happens with the 2+2 split, but obviously split that way.
The first shader unit can do texture addressing with the texture unit, using textures as source data for its fragment programs. The first shader unit can also normalize a FP16 source input for free, using a pixel shader instruction.
NVIDIA documentation states there's a mini ALU on each shader unit, but what you can do with it per clock is unclear, at least to me.
Vertex Shader
The vertex shader, of which there are six for geometry processing, consists of a vector unit for doing FP32 vector ops, a FP32 scalar unit for doing scalar ops on vertices and a texture lookup unit for vertex texturing. A branch processor sits after geometry production, to branch based on presented vertex data, before the unit feeds the output geometry to the triangle setup engine.The texture fetch engine can work on four textures per vertex program which should be enough for first generation Shader Model 3.0 hardware. NVIDIA's press documentation mentions
How it all fits together
With the basic building blocks of pixel and vertex shader units, they all combine to form the entire NV40 3D pipeline.The vertex shaders generate geometry data which is fed to the triangle setup engine. Here some Z-buffer optimisations are performed, possibly discarding pixels before they're sent to the pixel shaders, saving some bandwidth. That's part of the Intellisample 3.0 'engine'.
Geometry not discarded early is sent to the pixel shaders for processing there, where things are lit and textured using any number of methods, including the multiple render target method discussed earlier.
Output from the pixel shader is then sent through the fragment crossbar. The fragment crossbar splits data into sixteen chunks. Each chunk is the final part of the pipeline where antialiasing is (maybe) done, Z and colour compression are performed, floating point blending is done and render targets are combined.
Front buffer display is then done so you can see your processed frame. Repeat ad infinitum for each frame to be displayed.