Easier than making more extensive use of rpt, and the more compact
shaders seem to bring some bit of performance boost. (Perhaps repeat
flag benefits are more than just instruction cache, possibly it saves
on instruction decode as well?)
Signed-off-by: Rob Clark <robclark@freedesktop.org>
It seems the write-after-read hazard that applies to texture fetch
instructions, also applies to sfu instructions.
Also, cat5/cat6 instructions do not have a (ss) bit, so in these
cases we need to insert a dummy nop instruction with (ss) bit set.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Seems texture sample instructions don't immediately consume there
src(s). In fact, some shaders from blob compiler seem to indiciate that
it does not even count the texture sample instructions when calculating
number of delay slots to fill for non-sample instructions. (Although so
far it seems inconclusive as to whether this is required.)
In particular, when a src register of a previous texture sample
instruction is clobbered, the (ss) bit is needed to synchronize with the
tex pipeline to ensure it has picked up the previous values before they
are overwritten.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Was supposed to be a '+', otherwise we end up with a negative offset and
choosing registers below the assigned range.
This seems to fix the scheduling mystery "solved" by adding in extra
delay slots.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
Since 'kill' does not produce a result, the new compiler was happily
optimizing them out. We need to instead track 'kill's similar to
outputs. But since there is no non-predicated kill instruction,
(and for flattend if/else we do want them to be predicated), we need
to track the topmost branch condition on the stack and use that as src
arg to the kill. For a kill at the topmost level, we have to generate
an immediate 1.0 to feed into the cmps.f for setting the predicate
register.
Signed-off-by: Rob Clark <robclark@freedesktop.org>
The new compiler generates a dependency graph of instructions, including
a few meta-instructions to handle PHI and preserve some extra
information needed for register assignment, etc.
The depth pass assigned a weight/depth to each node (based on sum of
instruction cycles of a given node and all it's dependent nodes), which
is used to schedule instructions. The scheduling takes into account the
minimum number of cycles/slots between dependent instructions, etc.
Which was something that could not be handled properly with the original
compiler (which was more of a naive TGSI translator than an actual
compiler).
The register assignment is currently split out as a standalone pass. I
expect that it will be replaced at some point, once I figure out what to
do about relative addressing (which is currently the only thing that
should cause fallback to old compiler).
There are a couple new debug options for FD_MESA_DEBUG env var:
optmsgs - enable debug prints in optimizer
optdump - dump instruction graph in .dot format, for example:
http://people.freedesktop.org/~robclark/a3xx/frag-0000.dot.pnghttp://people.freedesktop.org/~robclark/a3xx/frag-0000.dot
At this point, thanks to proper handling of instruction scheduling, the
new compiler fixes a lot of things that were broken before, and does not
appear to break anything that was working before[1]. So even though it
is not finished, it seems useful to merge it in it's current state.
[1] Not merged in this commit, because I'm not sure if it really belongs
in mesa tree, but the following commit implements a simple shader
emulator, which I've used to compare the output of the new compiler to
the original compiler (ie. run it on all the TGSI shaders dumped out via
ST_DEBUG=tgsi with various games/apps):
163b6306b1
Signed-off-by: Rob Clark <robclark@freedesktop.org>