We can use a union in si_shader_key::mono.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Heaven LDS usage for LS+HS is below. The masks are "outputs_written"
for LS and HS. Note that 32K is the maximum size.
Before:
heaven_x64: ls=1f1 tcs=1f1, lds=32K
heaven_x64: ls=31 tcs=31, lds=24K
heaven_x64: ls=71 tcs=71, lds=28K
After:
heaven_x64: ls=3f tcs=3f, lds=24K
heaven_x64: ls=7 tcs=7, lds=13K
heaven_x64: ls=f tcs=f, lds=17K
All other apps have a similar decrease in LDS usage, because
the "outputs_written" masks are similar. Also, most apps don't write
POSITION in these shader stages, so there is room for improvement.
(tight per-component input/output packing might help even more)
It's unknown whether this improves performance.
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
LS is merged into TCS. If there is no TCS, LS is merged into fixed-func
TCS. The problem is the fixed-func TCS was ignored by scratch update
functions, so LS didn't have the scratch buffer set up.
Note that Mesa 17.1 doesn't have merged shaders.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
If we enqueue too many jobs and destroy the GL context, it may take
several seconds before the jobs finish. Just drop them instead.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
What we care about is whether PrimID is used while tessellation is
enabled; whether it's used in TCS/TES or further down the pipeline is
irrelevant.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This builds on commit 0549ea15ec ("radeonsi: fix primitive ID in
fragment shader when using tessellation").
Fixes piglit
arb_tessellation_shader/execution/gs-primitiveid-instanced.shader_test
Cc: 17.1 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
By keeping track of fewer generics, everything can fit into 64 bits.
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
OpenGL uses at most 32 generic outputs/inputs in any stage, and they always
have a shader IO index and therefore fit into the outputs_written/
inputs_read/kill_outputs fields.
However, Nine uses semantic indices more liberally. We support that
in VS-PS pipelines, except that the optimization of killing outputs
must be skipped.
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Make it a bit clearer that the index spaces are logically seperate by
having them defined in different functions.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
In a VS->TCS->TES->PS pipeline, the primitive ID is read from TES exports,
so it is as if TES were using the primitive ID.
Specifically, this fixes a bug where the primitive ID is not reset at
the start of a new instance.
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Picked from a different branch. When we stop using the scratch patching,
this function will not be called.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This removes s_load_dword latency for tess rings.
We need just 1 SGPR for the address if we use 64K alignment. The final asm
for recreating the descriptor is:
// s2 is (address >> 16)
s_mov_b32 s3, 0
s_lshl_b64 s[4:5], s[2:3], 16
s_mov_b32 s6, -1
s_mov_b32 s7, 0x27fac
v2: bitcast the descriptor type from v2i64 to v4i32
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The use of PrimID in the pixel shader is too rare to deserve such
a sizable support code.
The initial idea of the VS epilog was to move the clipping code there and
remove it based on states, but optimized variants are now used to do that
and are easier to support, so the VS epilog has turned out to be not so
useful.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The 2nd shader of merged shaders should take a reference of the 1st shader.
The next commit will do that.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This code can be shared by radv, we bump the max to
VARYING_SLOT_MAX here, but that shouldn't have too
much fallout.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Dave Airlie <airlied@redhat.com>
This will allow to hash additional data into the cache keys or even
change the hashing algorithm easily, should we decide to do so.
v2: don't try to compute key (and crash) if cache is disabled
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
threaded gallium can't use pipe_context's LLVM target machine, because
create_shader_selector can be called from a non-driver thread.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
pipe_mutex_unlock() was made unnecessary with fd33a6bcd7.
Replaced using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_unlock(\([^)]*\)):mtx_unlock(\&\1):g' {} \;
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
replace pipe_mutex_lock() was made unnecessary with fd33a6bcd7.
Replaced using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_lock(\([^)]*\)):mtx_lock(\&\1):g' {} \;
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
pipe_mutex_destroy() was made unnecessary with fd33a6bcd7.
Replace was done with:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_destroy(\([^)]*\)):mtx_destroy(\&\1):g' {} \;
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
pipe_mutex_init() was made unnecessary with fd33a6bcd7.
Replace was done using:
find ./src -type f -exec sed -i -- \
's:pipe_mutex_init(\([^)]*\)):(void) mtx_init(\&\1, mtx_plain):g' {} \;
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
V2:
- when loading from disk cache also binary insert into memory cache.
- check that the binary loaded from disk is the correct size. If not
delete the cache item and skip loading from cache.
V3:
- remove unrequired variable
Reviewed-by: Grigori Goronzy <greg@chown.ath.cx>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
R600_DEBUG=mono has had no effect since:
commit 1fabb29717
Author: Marek Olšák <marek.olsak@amd.com>
Date: Tue Feb 14 22:08:32 2017 +0100
radeonsi: have separate LS and ES main shader parts in the shader selector
Also, this assertion was failing:
si_state_shaders.c:1307: si_shader_select_with_key: Assertion
`!shader->is_optimized' failed.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
We were unconditionally storing these outputs, sometimes even one component
at a time, but apps never read them in TES.
Move the TESSINNER/OUTER buffer stores into the TCS epilog where we can
easily disable them on demand.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The same PS epilog workaround as for 8-bit integer formats is required,
since the CB doesn't do clamping.
Fixes GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels*.
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The update frequency is very low.
Difference: Only account for the size when allocating a new one and when
starting a new IB, and check for NULL. (v3)
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The perf difference is very small: 0.99% -> 0.40% for the time spent
in si_get_ia_multi_vgt_param when si_draw_vbo is 20%. Pretty much nothing.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
If the shader selector is created with a different context than
the shader variant, we should use the calling context's target machine
for the shader variant.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99419
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This seems to fix the GPU hangs caused by:
commit ed3190b3f3
Author: Marek Olšák <marek.olsak@amd.com>
Date: Sun Nov 13 18:41:43 2016 +0100
radeonsi: don't export ClipVertex and ClipDistance[] if clipping is disabled
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99219
Tested-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
We can hardcode all of the fields for swizzling in the geometry shader.
The advantage is that we use fewer descriptor slots and we no longer have to
update any of the (ring) descriptors when the geometry shader changes.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
A past commit added the ability to compile "optimized" shader variants
asynchronously (not stalling the app).
This commit builds upon that and adds what is basically a runtime shader
linker. If a VS output isn't used by the currently-bound PS, a new VS
compilation is started without that output. The new shader variant
is used when it's ready.
All apps using separate shader objects I've seen had unused VS outputs.
Eliminating unused/useless VS outputs also eliminates the corresponding
vertex attribute loads.
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This is the first user of optimized monolithic shader variants.
Cull distances can't be disabled by states.
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
key->part.*: prolog and epilog flags only
key->as_{ls,es}: special flags
key->mono.*: flags for monolithic compilation only
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The hardware always treats the alpha channel as unsigned, so add a shader
workaround. This is rare enough that we'll just build a monolithic vertex
shader.
The SINT case cannot actually happen in OpenGL, but I've included it for
completeness since it's just a mix of the other cases.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Fixes GL45-CTS.geometry_shader.adjacency.adjacency_indiced_triangle_strip and
others.
This leaves the case of triangle strips with adjacency and primitive restarts
open. It seems that the only thing that cares about that is a piglit test.
Fixing this efficiently would be really involved, and I don't want to use the
hammer of degrading to software handling of indices because there may well
be software that uses this draw mode (without caring about the precise
rotation of triangles).
v2:
- skip the GS prolog entirely if workaround is not needed
- only check for TES (TES is always non-null when tessellation is used)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The copy shader only depends on the selector. This change avoids creating
separate code paths for monolithic vs. non-monolithic geometry shaders.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
These constant value VS PARAM exports:
- 0,0,0,0
- 0,0,0,1
- 1,1,1,0
- 1,1,1,1
can be loaded into PS inputs using the DEFAULT_VAL field, and the VS exports
can be removed from the IR to save export & parameter memory.
After LLVM optimizations, analyze the IR to see which exports are equal to
the ones listed above (or undef) and remove them if they are.
Targeted use cases:
- All DX9 eON ports always clear 10 VS outputs to 0.0 even if most of them
are unused by PS (such as Witcher 2 below).
- VS output arrays with unused elements that the GLSL compiler can't
eliminate (such as Batman below).
The shader-db deltas are quite interesting:
(not from upstream si-report.py, it won't be upstreamed)
PERCENTAGE DELTAS Shaders PARAM exports (affected only)
batman_arkham_origins 589 -67.17 %
bioshock-infinite 1769 -0.47 %
dirt-showdown 548 -2.68 %
dota2 1747 -3.36 %
f1-2015 776 -4.94 %
left_4_dead_2 1762 -0.07 %
metro_2033_redux 2670 -0.43 %
portal 474 -0.22 %
talos_principle 324 -3.63 %
warsow 176 -2.20 %
witcher2 1040 -73.78 %
----------------------------------------
All affected 991 -65.37 % ... 9681 -> 3353
----------------------------------------
Total 26725 -10.82 % ... 58490 -> 52162
v2: treat Undef as both 0 and 1
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> (v1)
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com> (v1)
The table was copied from the Vulkan driver. The comment lines are as long
as the table for cosmetic reasons.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This is a serious performance fix. Discovered by luck.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94354
Cc: 12.0 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
radeonsi no longer supports pixel shaders without interpolation optimizations,
which led to assertion failures in si_shader_ps when running shader-db.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Heaven and Valley write gl_SampleMask and not Z.
Use 16_ABGR instead of 32_ABGR if Z isn't written.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This cuts down the overhead of si_dump_shader when ddebug is capturing
shader logs, which is done for every draw call unconditionally (that's
quite a lot of work for a draw call).
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Since commit d938b8c, the sample locations are no longer set unconditionally,
so we need to set the atom to dirty on all chips, not just Polaris.
Cc: 12.0 <mesa-stable@lists.freedesktop.org>
Adds a second optional cleanup callback, called after the fence is
signaled. This is needed if, for example, the queue has the last
reference to the object that embeds the util_queue_fence. In this
case we cannot drop the ref in the main callback, since that would
result in the fence being destroyed before it is signaled.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Otherwise, shader dumps can become interleaved and unusable.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
We only have to stay single-threaded when debug output must be synchronous.
This yields better parallelism in shader-db runs for me.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Unlike SC, the small primitive filter does not automatically use center
locations in 1xAA mode, so this is needed to avoid artifacts caused by
the small primitive filter discarding triangles that it shouldn't.
As a side effect of how the effective number of samples is now calculated,
this patch also avoids submitting the sample locations for line/poly smoothing
when they're not really needed.
Cc: 12.0 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Main shader parts and geometry shaders are compiled asynchronously
by util_queue. si_create_shader_selector doesn't wait and returns.
si_draw_vbo(si_shader_select) waits for completion.
This has the best effect when shaders are compiled at app-loading time.
It doesn't help much for shaders compiled on demand, even though
VS+PS compilation should take as much as time as the bigger one of the two.
If an app creates more shaders, at most 4 threads will be used to compile
them.
Debug output disables this for shader stats to be printed in the correct
order.
(We could go even further and build variants asynchronously too, then emit
draw calls without waiting and emit incomplete shader states, then force IB
chaining to give the compiler more time, then sync the compilation at the IB
flush and patch the IB with correct shader states. This is great for
compilation before draw calls, but there are some difficulties such as
scratch and tess states requiring the compiler output, and an on-disk shader
cache will likely be a much better and simpler solution.)
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
to allow multiple shaders to be compiled simultaneously.
ALso, shader-db can again use all 4 cores.
v2: Remove the pipe_mutex_unlock call in the error path.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> (v1)
The function interface is ready to be used by util_queue.
Also, si_shader_select_with_key can no longer accept si_context.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Getting LLVM IRs of hanging shaders have never been easier.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Handle the bc_optimize SGPR bit if both CENTER and CENTROID are enabled.
This should increase the PS launch rate for big primitives with MSAA.
Based on discussion with SPI guys.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This should increase the PS launch rate for shaders using at least 2 pairs
of perspective (i,j) and same for linear.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This yields a small performance improvement in Unigine Heaven.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The R_028B50_VGT_TESS_DISTRIBUTION value is copied from
amdgpu-pro. Smaller values in the ACCUM fields seem to
decrease the performance advantage from this patch, higher
values don't seem to matter.
v2: Add distribution mode field enums.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This allows running the TES on different CU's than the
TCS which results in performance improvements.
v2: Only write the control word from one invocation.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
We need to copy the VS outputs to memory. I decided to do this
using a shader key, as the value depends on other shaders.
I also switch the fixed function TCS over to monolithic, as
otherwisze many of the user SGPR's need to be passed to the
epilog, which increases register pressure, or complexity to
avoid that. The main body of the fixed function TCS is not
that interesting to precompile anyway, since we do it on
demand and it is very small.
v2: Use u_bit_scan64.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
The buffer is quite large, but should only be allocated if the
application uses tessellation. Most non-games don't.
v2: - Use the correct register for SI.
- Add define for block size.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
d3d 9 needs COLOR0 to be 1.0 on all channels when
undefined. 0.0 for the others is fine.
GL behaviour is undefined.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Experiments with framebuffer-no-attachments type draw calls have shown that
NULL exports stall terribly unless we ensure that export memory is allocated
by the SPI.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
- use an enum
- use a unique slot number regardless of the shader stage
(the per-stage slots will go away for RW buffers)
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
shader->config is not updated for compute kernels.
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Tom Stellard <thomas.stellard@amd.com>
This fixes GS piglit failures after adding SI_PARAM_SHADER_BUFFERS,
which bumped NUM_USER_SGPRS and uncovered this bug on SI.
If this was fixed in LLVM, these workarounds wouldn't be needed.
LLVM would have to look at the calling convention to know how many SGPR
inputs are declared, and add VCC and the scratch wave offset (which is
enabled even if we spill SGPRs but not VGPRs, oh well).
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This allows compiling the main shader part as ES or LS.
If we get the correct hint, non-separable GLSL shaders no longer have to be
compiled as VS first, followed by LS or ES compiled on demand.
The result is that fewer shaders are compiled by piglit, but it doesn't
improve piglit running time.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This can increase perf for shaders that kill pixels (kill, alpha-test,
alpha-to-coverage).
v2: add comments
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Still disabled.
Only prologs & epilogs are compiled in draw calls, but each variant of those
is compiled only once per process.
VS is always compiled as hw VS.
TES is always compiled as hw VS.
LS and ES stages are always compiled on demand.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This fixes FP16 conversion instructions for VI, which has 16-bit floats,
but not SI & CI, which can't disable denorms for those instructions.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
It was partly a state and partly emulated by shader code, but since we want
to do this in a fragment shader prolog, we need to put it into the shader
key, which will be used to generate the prolog.
This also removes the spi_ps_input states and moves the registers
to the PS state.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
BCOLOR inputs were immediately after COLOR inputs. Thus, all following inputs
were offset by 1 if color_two_side was enabled, and not offset if it was not
enabled, which is a variation that's problematic if we want to have 1 variant
per shader and the variant doesn't care about color_two_side (that should be
handled by other bytecode attached at the beginning).
Instead, move BCOLOR inputs after all other inputs, so BCOLOR0 is at location
"num_inputs" if it's present. BCOLOR1 is next.
This also allows removing si_shader::nparam and
si_shader::ps_input_param_offset, which are useless now.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This fixes it.
States which also need to be taken into account:
- SPI color formats - each down-conversion format supports only a limited set
of SPI formats
- whether MSAA resolving and logic op are enabled
These need special handling:
- blending
- disabled channels
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The addition of spi_shader_col_format killed all color outputs
in precompiled shaders.
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com> (v1)
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> (v1)
v2: also set the alpha func (trivial)
We now have an explicit parameter that contains the same information, and
this will allow us to get rid of is_gs_copy_shader in the si_shader struct.
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Specifically, when the API switches from using a GS to not using a GS and then
back to using the same GS again, we do not have to re-send all the GS state,
but we do have to send VGT_GS_MODE. So make VGT_GS_MODE consistently be a part
of the VS state.
This fixes a rendering bug in Dolphin, but surely other applications are
affected as well.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93648
Cc: "11.0 11.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
When a fragment shader is used that has no outputs but does conditional
discard (KILL_IF), all fragments are killed without this patch.
By comparing various register settings, my conclusion is that the exec mask
is either not properly forwarded to the DB by NULL exports or ends up being
unused, at least when there is _only_ a NULL export (the ISA documentation
claims that NULL exports can be used to override a previously exported exec
mask).
Of the various approaches I have tried to work around the problem, this one
seems to be the least invasive one.
v2: take discard by alpha test into account as well
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93761
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
because not using SPI_SHADER_32_ABGR doubles fill rate.
We should also get optimal performance if alpha isn't needed or blending
isn't enabled.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This does change the behavior slightly:
If a shader writes COLOR[i] and that color buffer isn't bound,
the shader will export MRT_NULL instead and discard the IR tree that
calculates the output. The only exception is alpha-to-coverage, which
requires an alpha export.
v2: - update a comment about 16BPC
- account for MRTZ when when fixing alpha-test/kill
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
I'm not sure about the consequences of this bug, but it's definitely
dangerous.
This applies to SI, CIK, VI.
Cc: 11.0 11.1 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
There will be 1 config per variant, which will be a union of configs
from {prolog, main, epilog}. For now, just add the structure.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This doesn't fix a known bug, but better safe than sorry.
Also, simplify the expression in si_shader.c.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
because the API pixel shader binary will not emulate alpha test one day,
so the KILL_ENABLE bit must be determined elsewhere.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This will allow us to send shader debug info.
Reviewed-by: Edward O'Callaghan <eocallaghan@alterapraxis.com> (v1)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This changes the count slightly (because of si_generate_gs_copy_shader), but
this is only relevant for the driver-specific num-compilations query. It sets
the stage for the next commit.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Use NULL tests of the form `if (ptr)' or `if (!ptr)'.
They do not depend on the definition of the symbol NULL.
Further, they provide the opportunity for the accidental
assignment, are clear and succinct.
Signed-off-by: Edward O'Callaghan <eocallaghan@alterapraxis.com>
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
This reduces the shader key for ES.
Use a fixed attrib location based on (semantic name, index).
The ESGS item size is determined by the physical index of the highest ES
output, so it's almost always larger than before, but I think that
shouldn't matter as long as the ESGS ring buffer is large enough.
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
I discovered that increasing the ESGS ring size fixes GS hangs on Tonga,
so let's do it properly.
There is now a separate init_config_gs_rings state that is not immutable,
because GS rings are resized when needed.
This also saves some memory. Most apps won't need more than 1MB
per ring per shader engine.
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The "current" shader pointer is moved from the CSO to the context, so that
the CSO is mostly immutable.
The only drawback is that the "current" pointer isn't saved when unbinding
a shader and it must be looked up when the shader is bound again.
This is also a prerequisite for multithreaded shader compilation.
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>