Since this graph is actually not oriented, its adjacency matrix can be
represented using less than half bits required by full adjacency matrix.
It reduces memory consumption and number of cache misses. It also simplifies
logic of growing this matrix - no need to touch adjacency bits for previously
allocated number of nodes.
Move adjacency bits from nodes to graph to reduce the number of allocations.
No changes to shader-db.
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Signed-off-by: Kostiantyn Lazukin <kostiantyn.lazukin@globallogic.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14189>
Now that we have an idea of how many regs the conflicting allocation uses,
we can just skip to the next one and save repeated tests to find the same
conflicting neighbor again.
shadowrun-returns shader-db time on skl -1.62821% +/- 1.58079% (n=679),
now there's no statistically significant change from the start of the series
(n=420)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>
In the fully general case of register classes, to expose an allocation
class of unaligned 2-contiguous-regs allocations, for example, you'd have
your base individual regs (128 on intel), and another set of 127 regs that
each conflicted with the corresponding pair of the base regs. Single-reg
nodes would allocate in the 128, and double-reg nodes would allocate in
the 127 and the user would remap from the 127 down to the base regs with
some irritating table.
If you need many different contiguous allocation sizes (16 is a pretty
common number across drivers), your number of regs explodes, wasting
memory and making the q computation expensive at startup.
If all the user has is contiguous-reg classes, we can easily compute the q
value up front (as found in the intel driver and nouveau, for example),
and we only have to change a couple of places in the conflict-checking
logic so the contiguous-reg classes can use the base registers.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9437>
BITSET_FOR_EACH_SET can walk a sparse set (such as a register class's set
of registers) much faster than just iterating over individual bits.
Improves freedreno startup time (as measured by shader-db ./run
shaders/closed/gputest/triangle on my x86 system) by -4.12679% +/-
1.99006% (n=151)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4537>
This make the code significantly more readable, I think (along with
shorter). Also, using util_dynarray_delete_unordered() saves us a move of
the rest of the list when removing adjacency on a node.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4537>
This moves the fi_types to a new mesa_private.h and removes the
imports.c file. The vast majority of this patch is just removing
pound includes of imports.h and fixing up the recursive includes.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3024>
In the select_reg callback, I want to be able to determine if a given
node is already assigned, and if so what physical register has been
assigned.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4071>
Add a parameter so the callback can know which node it is selecting a
register for. And remove the graph parameter, as it is unused by
existing users, and somewhat unnecessary (ie. the callback data could
be used instead).
And add a comment so $future_me remembers how this works.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4071>
Setting up transitive conflicts between a full register and its two
half registers (eg r0.x and hr0.x and hr0.y) will make the half
registers conflict. They don't actually conflict and this prevents us
from using both at the same time.
Add and use a new ra helper that sets up transitive conflicts between
a register and its subregisters, except it carefully avoids the
subregister conflict.
Signed-off-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Rob Clark <robdclark@chromium.org>
Complements the existing getters and the setter for node class. To be
used in the Panfrost RA refactor.
Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
We want to be able to call ra_allocate() and, when it fails, mutate the
graph and try again rather than re-building the graph from scratch.
This commit moves all the scratch bits except the final register
allocation (which is really an out value not scratch) into sub-structs
named "tmp" to make it clear which things are scratch. It also adds
bits to the ra_select() initialization loop to initialize things (since
we can't trust rzalloc anymore) and copy q_test and forced_reg over.
Reviewed-by: Eric Anholt <eric@anholt.net>
The most expensive part of register allocation is the ra_simplify step
which is a fixed-point algorithm with a worst-case complexity of O(n^2)
which adds the registers to a stack which we then use later to do the
actual allocation. This commit uses bit sets and changes the core loop
of ra_simplify to first walk 32-node chunks and then walk each chunk.
This lets us skip whole 32-node chunks in one go based on bit operations
and compute the minimum q value potentially 32x as fast. Of course, the
algorithm still has the same fundamental O(n^2) run-time but the
constant is now much lower.
In the nasty Aztec Ruins compute shader, this shaves a full four seconds
off the 30s compile time for a release build of mesa. In a debug build
(needed for accurate stack traces), perf says that ra_select takes 20%
of runtime before this patch and only 5-6% of runtime after this patch.
It also makes shader-db runs faster.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355468050 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2602.37 -> 2524.31 (-3.00%)
Reviewed-by: Eric Anholt <eric@anholt.net>
We only use q_total if the reg is not assigned so there's no point in
updating it if the reg is not assigned. This has no known perf benefit
but it will reduce churn in a future commit.
Reviewed-by: Eric Anholt <eric@anholt.net>
This shaves about half a second off the 30 second compile time of one of
the compute shaders in Aztec ruins.
Reviewed-by: Eric Anholt <eric@anholt.net>
- remove mtypes.h from most header files
- add main/menums.h for often used definitions
- remove main/core.h
v2: fix radv build
Reviewed-by: Brian Paul <brianp@vmware.com>
CID: 1415909
Fixes: 7a34a0e890 "ra: Add a callback for selecting a register
from what's available."
Signed-off-by: Eric Engestrom <eric@engestrom.ch>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
VC4 has had a tension, similar to pre-Sandybridge Intel, where we want to
use low-numbered registers (more parallelism on Intel, fewer delay slots
on vc4), but in order to give instruction scheduling the most freedom to
avoid delays we want to round-robin between registers of the same cost.
Our two heuristics so far have chosen one end or the other of that
tradeoff.
The callback, instead, hands the driver the set of registers that are
available, and the driver gets to make its own choice. This will be used
in vc4 to round-robin between registers of the same cost, and might be
used in the future for improving bank selection.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
All the paths looping over adjacency had guards against considering
themselves (the non-obvious one was ra_any_neighbors_conflict(), which has
in_stack set).
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
I was going to indent this code another level, and decided it would be
easier to read as a helper.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Since i965 is now using make_reg_conflicts_transitive and doesn't need
q-value computations, they are disabled on i965. They are enabled
everywhere else so that they get the old behavior. This reduces the time
spent in eglInitialize() on BDW by around 10-15%.
Reviewed-by: Eric Anholt <eric@anholt.net>
The round-robin allocation strategy is expected to decrease the amount
of false dependencies created by the register allocator and give the
post-RA scheduling pass more freedom to move instructions around. On
the other hand it has the disadvantage of increasing fragmentation and
decreasing the number of equally-colored nearby nodes, what increases
the likelihood of failure in presence of optimistically colorable
nodes.
This patch disables the round-robin strategy for optimistically
colorable nodes. These typically arise in situations of high register
pressure or for registers with large live intervals, in both cases the
task of the instruction scheduler shouldn't be constrained excessively
by the dense packing of those nodes, and a spill (or on Intel hardware
a fall-back to SIMD8 mode) is invariably worse than a slightly less
optimal scheduling.
Shader-db results on the i965 driver:
total instructions in shared programs: 5488539 -> 5488489 (-0.00%)
instructions in affected programs: 1121 -> 1071 (-4.46%)
helped: 1
HURT: 0
GAINED: 49
LOST: 5
v2: Re-enable round-robin already for the lowest one of the nodes
pushed optimistically onto the sack (Connor).
v3: Use UINT_MAX instead of ~0, open-code MIN2 (Jason, Connor).
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
The i965 backends pass something out of 'screen', which is allocated
per-process, making using this as a ralloc context not thread-safe.
All callers ra_alloc_interference_graph() already ralloc_free() its
return value.
Reviewed-by: Jason Ekstrand <jason.ekstrand@intel.com>
This shouldn't be a functional change since reg_belongs_to_class is just a
wrapper around BITSET_TEST. It just makes the code a little easier to
read.
Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
The r300 gallium driver is using it outside of the Mesa tree, and I wanted
to do so for vc4 as well. Rather than make the multiple-definitions
problem even more complicated, just move it to more-shared code.
v2: Don't forget to delete the symlink in r300 (review by Matt).
Delete more r300-helper references (review by Emil)
Don't prefix util/ header inclusion with "util/" (review by Emil)
Reviewed-by: Matt Turner <mattst88@gmail.com> (v1)
Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> (v1)
2014-09-23 13:40:10 -07:00
Renamed from src/mesa/program/register_allocate.c (Browse further)