Hardware Analysis
Like in the NV40 article, I'll attempt to describe and then arrange R420's functional groups into a complete rendering pipeline, so you can see how it all fits together.Pixel Shader
I mentioned a little about the pixel shader unit on R420 on page 2, but it's pertinent to cover it again here.The unit is fed by the geometry setup engine, via the Hyper Z unit which may or may not discard pixels before they are fed in.
Shader state is loaded and evaluated, be it the first time into the processor or having been looped around again for further processing, then the shader is run.
Up to five operations per clock can be performed per pixel shader unit, per clock. Those ops are done on two vector ALUs (position and directional components of a pixel), two scalar ALUs (how much to scale a given vector) and a texture ALU does the fifth.
Texturing (using a texture as a source of data, which may or may not be an actual visible picture texture) from the pixel shader is essential for a wide range of effects and graphics techniques.
Each vector ALU can do a three component vector op per clock. Three components means full 3D space processing for each vector (vectors can have an arbitrary number of components/dimensions).
The vector and scalar ALU pairs aren't equal in capability, but ATI are reluctant to reveal what the differences between the are. The entire ALU setup is dual-issue, allowing for the five ops per clock scenario. NV40 can't do a texture address op separate to the other ALU ops, since its main FP32 ALU is the only unit with access to the texture unit, giving R420 a slight performance edge in terms of raw op throughput at the same clocks.
After processing, the fragment output is sent to a combiner unit which sorts/arranges them for further processing.
Vertex Shader
Like NV40, R420 has six vertex processors. Each has a full 128-bit vector ALU and a 32-bit component scalar ALU, for maximum precision when doing scalar calculations.A branch/loop unit can then send vertices back into the pipeline for reprocessing, should the need arise.
Dual ALUs per vertex processor means two ops per clock, per processor, for twelve in total.
Final vertex output is sent to the geometry setup engine. The entire pipeline should now be clear.
How it all fits together
The vertex processors generate vertex data which is fed to the setup engine. The setup engine arranges that generated geometry into polygons, which are assigned their needed colour, depth and texture coordinate data. They are then arranged into tiles of pixels that define each polygon. Those pixel tiles are fed into the Hyper Z unit.Hyper Z, as discussed, will discard pixels (at a maximum rate of 256 per clock) that will never be seen, before passing them to the pixel/fragment shader.
The pixel shader does its work before feeding output fragments to the SmoothVision unit for anti-aliasing and possible colour compression. Any render targets are then combined there before final output into any output buffers.
Output buffers are then displayed on screen, repeat for the next visible frame.