The test to determine which of the pixels in a 2x2 quad is now done in
the fragment shader rather than in the calling C code. This is a little
faster but there's a few more things to do.
Note that the step[] array elements are in a different order now. Rather
than being in row-major order for the 4x4 grid, they're in "quad-major"
order. The setup of the step arrays is a little more complicated now.
So is the course/intermediate tile test code, but some lookup tables
help with that.
Next steps:
- early-cull 2x2 quads which are totally outside the triangle.
- skip the in/out test for fully contained quads
- make the in/out comparison code tighter/faster.
This matches the convention used by the recursive rasterizer.
Also fixed assorted typos, comments, etc.
Now tri-z.c, gears.c, etc look basically right but there's still some
cracks in triangle rasterization.
This was redundant as drivers can just keep track of whether they are
inside a begin/end query pair. We want to add more query types later
and also support nested queries, none of which map well onto a flag like
this. No driver appeared to be using the flag.
New control flow helper functions which keep track of all variables
and generate the correct Phi functions.
This re-enables skipping the fs execution of quads masked out by
the rasterizer, early z testing, and kill opcode.
This yields a performance improvement of around 20%.
Finally a substantial performance improvement: framerates of apps using
texturing tripled, and furthermore, enabling/disabling texturing only
affects around 15% of the framerate, which means the bottleneck is now
somewhere else.
Generated texture sampling code is not complete though -- we always
sample from the base level -- so final figures will be different.
Special attention is given to the interpolation of side by side quads.
Multiplications are made only for the first quad. Interpolation of
inputs for posterior quads are done exclusively with additions, and
perspective divide if necessary.