v2:
* Add build error for gen > 6 if MOCS is not set. (Lionel)
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
There's some debate about whether we should support this on older
hardware as well. Currently i965 turns it off on Gen8- though, so
we follow suit. If this changes, we can update this as well.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Corresponding to GL_ARB_fragment_shader_interlock and
GL_NV_fragment_shader_interlock. Currently, only the NIR paths
support this functionality, but someone could conceivably add it
to TGSI too.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
In the future I want to expand this to 128-bits, for vec16 support, so
lets just put the code in place to use bitset ranges now.
v2: just declare the bitset to be the max of what we should ever see
and change assert to reflect it.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Our tessellation control shaders can be dispatched in several modes.
- SINGLE_PATCH (Gen7+) processes a single patch per thread, with each
channel corresponding to a different patch vertex. PATCHLIST_N will
launch (N / 8) threads. If N is less than 8, some channels will be
disabled, leaving some untapped hardware capabilities. Conditionals
based on gl_InvocationID are non-uniform, which means that they'll
often have to execute both paths. However, if there are fewer than
8 vertices, all invocations will happen within a single thread, so
barriers can become no-ops, which is nice. We also burn a maximum
of 4 registers for ICP handles, so we can compile without regard for
the value of N. It also works in all cases.
- DUAL_PATCH mode processes up to two patches at a time, where the first
four channels come from patch 1, and the second group of four come
from patch 2. This tries to provide better EU utilization for small
patches (N <= 4). It cannot be used in all cases.
- 8_PATCH mode processes 8 patches at a time, with a thread launched per
vertex in the patch. Each channel corresponds to the same vertex, but
in each of the 8 patches. This utilizes all channels even for small
patches. It also makes conditions on gl_InvocationID uniform, leading
to proper jumps. Barriers, unfortunately, become real. Worse, for
PATCHLIST_N, the thread payload burns N registers for ICP handles.
This can burn up to 32 registers, or 1/4 of our register file, for
URB handles. For Vulkan (and DX), we know the number of vertices at
compile time, so we can limit the amount of waste. In GL, the patch
dimension is dynamic state, so we either would have to waste all 32
(not reasonable) or guess (badly) and recompile. This is unfortunate.
Because we can only spawn 16 thread instances, we can only use this
mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH.
This patch implements the new 8_PATCH TCS mode, but leaves us using
SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to
using 8_PATCH mode for testing and benchmarking purposes. We may
want to consider using 8_PATCH mode in Vulkan in some cases.
The data I've seen shows that 8_PATCH mode can be more efficient in
some cases, but SINGLE_PATCH mode (the one we use today) is faster
in other cases. Ultimately, the TES matters much more than the TCS
for performance, so the decision may not matter much.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The payload field is actually "instance" (thread number), which is used
to calculate the invocation ID.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
When we add 8_PATCH mode, this will get a bit more complex, so we may
as well start by putting it in a helper function.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This diverged back in f1374805a8 ("drm-uapi: use local files, not system
libdrm") to point at drm-uapi's copy, which we don't need now that we're
actually in drm-uapi.
Reviewed-by: Rob Clark <robdclark@gmail.com>
The goal is to avoid having an extra MOV instruction to perform the
saturate. Doing the subtraction first allows the saturate to be applied
to the ADD instruction making the MOV unnecessary. Values generated in
different block and values from non-ALU instructions (e.g., texture
instructions) almost always need the extra MOV.
Multiply instructions are restricted because doing this rearrangement
can interfere with the generation of flrp and ffma instructions.
v2: Now that the final method has been selected, squash three commits
into one.
All Intel platforms has similar results. (Ice Lake shown)
total instructions in shared programs: 17223214 -> 17219386 (-0.02%)
instructions in affected programs: 1524376 -> 1520548 (-0.25%)
helped: 2686
HURT: 26
helped stats (abs) min: 1 max: 32 x̄: 1.44 x̃: 1
helped stats (rel) min: 0.03% max: 16.67% x̄: 0.54% x̃: 0.37%
HURT stats (abs) min: 1 max: 2 x̄: 1.69 x̃: 2
HURT stats (rel) min: 0.33% max: 1.67% x̄: 0.54% x̃: 0.35%
95% mean confidence interval for instructions value: -1.46 -1.36
95% mean confidence interval for instructions %-change: -0.56% -0.50%
Instructions are helped.
total cycles in shared programs: 360811571 -> 360791896 (<.01%)
cycles in affected programs: 103650214 -> 103630539 (-0.02%)
helped: 1557
HURT: 675
helped stats (abs) min: 1 max: 1773 x̄: 41.44 x̃: 16
helped stats (rel) min: <.01% max: 26.77% x̄: 1.37% x̃: 0.64%
HURT stats (abs) min: 1 max: 1513 x̄: 66.44 x̃: 14
HURT stats (rel) min: <.01% max: 46.16% x̄: 2.00% x̃: 0.49%
95% mean confidence interval for cycles value: -14.82 -2.81
95% mean confidence interval for cycles %-change: -0.50% -0.20%
Cycles are helped.
LOST: 2
GAINED: 0
Reviewed-by: Matt Turner <mattst88@gmail.com> [v1]
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
The value-range tracking pass that is coming is not clever enough to
know that the result of the ffma must be non-negative. Making it that
smart will require quite a bit of work. It might be possible to add a
special case that detects that a whole tree of fadd(fmul(fsat(a),
fneg(fsat(a))), 1.0) cannot be negative.
For cases when the comparison is used in the domain guard for a
square-root (see nir/algebraic: Simplify fsqrt domain guard), the
compare may be converted to a fmax. This patch also handles that case.
All of the affected cases are in DiRT: Showdown.
All Gen7+ platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17225365 -> 17225303 (<.01%)
instructions in affected programs: 40051 -> 39989 (-0.15%)
helped: 62
HURT: 0
helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1
helped stats (rel) min: 0.07% max: 0.66% x̄: 0.27% x̃: 0.26%
95% mean confidence interval for instructions value: -1.00 -1.00
95% mean confidence interval for instructions %-change: -0.31% -0.22%
Instructions are helped.
total cycles in shared programs: 360842788 -> 360842595 (<.01%)
cycles in affected programs: 1818081 -> 1817888 (-0.01%)
helped: 29
HURT: 22
helped stats (abs) min: 1 max: 206 x̄: 20.66 x̃: 14
helped stats (rel) min: <.01% max: 9.55% x̄: 0.87% x̃: 0.42%
HURT stats (abs) min: 1 max: 108 x̄: 18.45 x̃: 7
HURT stats (rel) min: <.01% max: 4.48% x̄: 0.56% x̃: 0.19%
95% mean confidence interval for cycles value: -14.48 6.91
95% mean confidence interval for cycles %-change: -0.71% 0.21%
Inconclusive result (value mean confidence interval includes 0).
No changes on any other Intel platform.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
Without this, adding an algebraic rule like
(('bcsel', ('flt', a, 0.0), 0.0, ...), ...),
will cause assertion failures inside nir_src_comp_as_float in
GTF-GL46.gtf21.GL.lessThan.lessThan_vec3_frag (and related tests) from
the OpenGL CTS and shaders/closed/steam/witcher-2/511.shader_test from
shader-db.
All of these cases have some code that ends up like
('bcsel', ('flt', a, 0.0), 'b@1', ...)
When the 'b@1' is tested, nir_src_comp_as_float fails because there's
no such thing as a 1-bit float.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
This change also enables a later change (nir/algebraic: Replace
1-fsat(a) with fsat(1-a)) to affect more shaders.
Almost all of the affected shaders are in Bioshock Infinite, and all of
those shaders all require GLSL 4.10.
All Intel platforms had similar results. (Ice Lake shown)
total instructions in shared programs: 17228584 -> 17228376 (<.01%)
instructions in affected programs: 31438 -> 31230 (-0.66%)
helped: 105
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 1.98 x̃: 1
helped stats (rel) min: 0.08% max: 1.53% x̄: 0.73% x̃: 0.70%
95% mean confidence interval for instructions value: -2.20 -1.76
95% mean confidence interval for instructions %-change: -0.80% -0.67%
Instructions are helped.
total cycles in shared programs: 360936431 -> 360935690 (<.01%)
cycles in affected programs: 420100 -> 419359 (-0.18%)
helped: 71
HURT: 21
helped stats (abs) min: 1 max: 160 x̄: 19.28 x̃: 10
helped stats (rel) min: <.01% max: 9.78% x̄: 0.95% x̃: 0.48%
HURT stats (abs) min: 1 max: 198 x̄: 29.90 x̃: 10
HURT stats (rel) min: 0.05% max: 8.36% x̄: 1.24% x̃: 0.90%
95% mean confidence interval for cycles value: -16.77 0.66
95% mean confidence interval for cycles %-change: -0.85% -0.06%
Inconclusive result (value mean confidence interval includes 0).
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Thomas Helland <thomashelland90@gmail.com>
This doesn't make any real difference now, but future work (not in this
series) will add a LOT of ffma patterns. Having to duplicate all of
them for ffma(a, b, c) and ffma(b, a, c) is just terrible.
No shader-db changes on any Intel platform.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
v2: Instead of handling 3 sources as a special case, generalize with
loops to N sources. Suggested by Jason.
v3: Further generalize by only checking that number of sources is >= 2.
Suggested by Jason.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The meaning of the new name is that the first two sources are
commutative. Since this is only currently applied to two-source
operations, there is no change.
A future change will mark ffma as 2src_commutative.
It is also possible that future work will add 3src_commutative for
opcodes like fmin3.
v2: s/commutative_2src/2src_commutative/g. I had originally considered
this, but I discarded it because I did't want to deal with identifiers
that (should) start with 2. Jason suggested it in review, so we decided
that _2src_commutative would be used in nir_opcodes.py. Also add some
comments documenting what 2src_commutative means. Also suggested by
Jason.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Instead of re-building the interference graph every time we spill, we
modify it in place so we can avoid recalculating liveness and the whole
O(n^2) interference graph building process. We make a simplifying
assumption in order to do so which is that all spill/fill temporary
registers live for the entire duration of the instruction around which
we're spilling. This isn't quite true because a spill into the source
of an instruction doesn't need to interfere with its destination, for
instance. Not re-calculating liveness also means that we aren't
adjusting spill costs based on the new liveness. The combination of
these things results in a bit of churn in spilling. It takes a large
cut out of the run-time of shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311360 (<.01%)
instructions in affected programs: 77027 -> 77163 (0.18%)
helped: 11
HURT: 18
total cycles in shared programs: 355544739 -> 355830749 (0.08%)
cycles in affected programs: 203273745 -> 203559755 (0.14%)
helped: 234
HURT: 190
total spills in shared programs: 12049 -> 12042 (-0.06%)
spills in affected programs: 2465 -> 2458 (-0.28%)
helped: 9
HURT: 16
total fills in shared programs: 25112 -> 25165 (0.21%)
fills in affected programs: 6819 -> 6872 (0.78%)
helped: 11
HURT: 16
Total CPU time (seconds): 2469.68 -> 2360.22 (-4.43%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This is slightly less convenient in some places but it will make it much
easier when we want to start adding nodes dynamically.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The old code was arranged by the type of interference being added. It
would set up payload registers and then add payload interference for all
VGRFs. It would set up MRFs and add MRF interference for all VGRFs.
This commit re-arranges things to be organized differently. It first
creates and sets up all RA nodes and then groups interference into two
new categories: live range and instruction interference. Once all the
RA nodes have been set up, it walks the list of VGRFs and sets up their
live range interference and then walks the list of instructions and sets
up instruction interference. This new arrangement will be advantageous
for a future patch but, at the moment, it cuts 2% off the run-time of
shader-db on my laptop.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311224 -> 15311224 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355544739 -> 355544739 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The only use of the MRF hack these days is for spilling and there we
don't need the precise MRF usage information. If we're spilling then we
know pretty well how many MRFs are going to be used. It is possible if
the only things that are spilled have fewer SIMD channels than the
dispatch width of the shader that this may be more MRFs than needed.
That's a risk we're willing to takd.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311224 (<.01%)
instructions in affected programs: 16664 -> 16788 (0.74%)
helped: 1
HURT: 5
total cycles in shared programs: 355543197 -> 355544739 (<.01%)
cycles in affected programs: 731864 -> 733406 (0.21%)
helped: 3
HURT: 6
The hurt shaders are all SIMD32 compute shaders where we reserve enough
space for a 32-wide spill/fill but don't need it.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This accomplishes two things. First, it makes interfaces which are
really private to RA private to RA. Second, it gives us a place to
store some common stuff as we go through the algorithm.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
There's no reason why we need to use the calculated payload_node_count
value which is just first_non_payload_grf aligned up. The grf_used
value will be aligned up to 16 anyway (which is a much bigger alignment)
before being handed off to hardware.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We only have one node per VGRF so this was adding way too much
interference. No idea how we didn't catch this before.
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355543197 (0.02%)
cycles in affected programs: 2472492 -> 2547639 (3.04%)
helped: 17
HURT: 20
Fixes: 014edff0d2 "intel/fs: Add interference between SENDS sources"
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We want to be able to call ra_allocate() and, when it fails, mutate the
graph and try again rather than re-building the graph from scratch.
This commit moves all the scratch bits except the final register
allocation (which is really an out value not scratch) into sub-structs
named "tmp" to make it clear which things are scratch. It also adds
bits to the ra_select() initialization loop to initialize things (since
we can't trust rzalloc anymore) and copy q_test and forced_reg over.
Reviewed-by: Eric Anholt <eric@anholt.net>
Unfortunately, we can't quite follow the standard C conventions for
these because ralloc doesn't know the sizes of pointers.
Reviewed-by: Eric Anholt <eric@anholt.net>
In the last phase of the schedule and RA loop, the RA call is redundant
if we spill. Immediately afterwards, we're going to see that we
couldn't allocate without spilling and call back into RA and tell it to
go ahead and spill. We've known about it for a while but we've always
brushed over it on the theory that, if you're going to spill, you'll be
calling RA a bunch anyway and what does one extra RA hurt? As it turns
out, it hurts more than you'd expect. Because the RA interference graph
gets sparser with each spill and the RA algorithm is more efficient on
sparser graphs, the RA call that we're duplicating is actually the most
expensive call in the RA-and-spill loop.
There's another extra RA call we do that's a bit harder to see which
this also removes. If we try to compile a shader that isn't the minimum
dispatch width and it fails to allocate without spilling we call fail()
to set an error but then go ahead and do the first spilling RA pass and
only after that's complete do we detect the fail and bail out. By
making minimum dispatch widths part of the spill condition, we side-step
this problem.
Getting rid of these extra spills takes the compile time of a nasty
Aztec Ruins shader from about 28 seconds to about 26 seconds on my
laptop. It also makes shader-db 1.5% faster
Shader-db results on Kaby Lake:
total instructions in shared programs: 15311100 -> 15311100 (0.00%)
instructions in affected programs: 0 -> 0
helped: 0
HURT: 0
total cycles in shared programs: 355468050 -> 355468050 (0.00%)
cycles in affected programs: 0 -> 0
helped: 0
HURT: 0
Total CPU time (seconds): 2524.31 -> 2486.63 (-1.49%)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>