Changes:
- disallow NGG culling for GS, fast launch for tess using template args
(GS can't do NGG culling, tess can't do fast launch)
- skip checking current_rast_prim with tessellation
(bake the condition into ngg_cull_vert_threshold)
- use only 1 vertex count threshold for enabling NGG shader culling
to simplify it. I think it doesn't have a big impact. The threshold
computation depends on more parameters than just fast launch.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8434>
It didn't do anything useful. GS doesn't use the other user SGPRs.
If we decrease the number of user SGPRs we declare for the GS prolog,
we can remove gfx9_prev_is_vs.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8344>
LLVM expects that exec != 0 when entering loops and generates this code
that becomes an infinite loop if exec == 0:
BB5_1:
vcc_lo = (inverted terminating condition)
s_and_b32 vcc_lo, exec_lo, vcc_lo
s_cbranch_vccnz BB5_3 // jump if vcc != 0 (break statement)
// ... loop body ...
s_branch BB5_1
BB5_3:
For non-monolithic VS before TCS, VS before GS, and TES before GS,
we set exec = (thread enabledmask), which sets 0 for HS-only and GS-only
waves, causing the infinite loop condition above.
Fix it as follows:
- set exec = ~0 at the beginning
- wrap the whole shader (LS and ES) in a conditional block, so that HS-only
and GS-only waves jump over it and never enter such a loop
The TES before GS hang can be reproduced by gfxbench:
testfw_app --gfx egl -w 1920 -h 1080 --gl_api gles -t gl_tess
Fixes: 68d6d097f1 - radeonsi/gfx9: add GFX9 and VEGA10 enums
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8344>
Enable vrs2x2 coarse shading if flat shading as per
idea and guidance given by Marek.
is_flat_shading variable in struct si_shader_info is set
based on the data from gather_intrinsic_info() function
and struct si_state_rasterizer. If is_flat_shading_variable
is set, then in function si_emit_db_render_state() vrs2x2
shading is enabled in hardware.
v2: Fix review comments from Pierre-Eric. Code optimizations.
v3: Fix indentation style issue.
v4: Fix review comments from Marek. Fixed logical issue pointed
by Marek where info->is_flat_shading variable can be corrupted
and other code cleanup.
v5: Make the code compact as suggested by Pierre-Eric.
v6: Fix new review comments from Marek.
v7: use info->uses_interp_color variable fix from Marek.
v8: Fix coding style comment from Marek.
v9: Add uses_fbfetch_output check as suggested by Marek.
Signed-off-by: Yogesh Mohan Marimuthu <yogesh.mohanmarimuthu@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8161>
Fixes:
- Sample shading now uses per-sample interpolation for colors if colors
are the only inputs. (this is the only case that was broken)
Optimizations:
- BC_OPTIMIZE (barycentric optimization) is now enabled with MSAA if colors
are qualified with both center and centroid. (BC_OPTIMIZE means that
the hardware skips initializing centroid (i,j) if they are equal to
center (i,j))
- If MSAA is disabled and at least 2 out of (center, centroid, sample) are
used by all inputs now including colors, center is forced for all inputs.
- If INTERP_MODE_COLOR is not used and the legacy GL shade model is flat,
the shader variant for flat shading is not generated.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8225>
This increases performance for indexed triangle strips up to +100%.
In practice, it's limited by memory bandwidth and compute power,
so 256-bit memory bus and a lot of CUs are recommended.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7681>
It can only be done if a TCS input is accessed without indirect indexing and
with gl_InvocationID as the vertex index, and the number of VS and TCS threads
is the same.
This eliminates LDS stores and loads for VS->TCS IO, reducing shader lifetime
and LDS traffic.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7623>
Instead of:
if (VS) {
VS;
}
if (TCS) {
TCS;
}
Do this if the number of threads is the same in VS and TCS:
exec = enabled_threads;
VS;
TCS;
Skipping declare_vb_descriptor_input_sgprs is needed to match the VS return
values.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7623>
This improves performance for uber shaders.
It must be enabled using the new driconf option.
The driver compiles the specialized shaders in another thread without stalls,
same as all other optimizations.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7057>
If we store the position into LDS after we know the new thread ID,
we don't need to remember the old thread ID.
The culling code only needs W, X/W, Y/W, so we have to keep those.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7172>
Add a vertex count threshold into si_shader_selector to simplify
the draw_vbo code.
The new option is supposed to be used in 00-mesa-defaults.conf and should be
tweaked for best performance unlike the AMD_DEBUG experimental options.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6948>
This removes type conversions from 16 bits to 32 bits in the main function
and then back to 16 bits in the epilog.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6622>
Fixed-func shaders can contain the output, because their generator
doesn't consider the current primitive type into account.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6620>
Input and output info is gathered from intrinsics. nir_variables are
ignored (and we'll remove them anyway).
This is a prerequisite for ACO, but also makes the IR prettier.
The ac_nir_to_llvm change has to be in this commit.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6445>
Only non-indexed triangle lists and strips are supported. This increases
performance if there is something to cull.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
The value is not changed. I just use a different way to compute it.
The value will vary with NGG culling.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
This decreases memory usage, because serialized NIR is more compact.
The main shader part is compiled from nir_shader.
Monolithic shader variants are compiled from nir_binary.
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
We need two different values of the register, one for NGG and one for
legacy, in order to fix edge flags for the legacy pipeline.
Passing the ngg flag to emit_clip_regs would be too complicated,
so CONTEXT_REG_RMW is used for partial register updates.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Lowering PS inputs can eliminate some of them, which messes up
persp/linear barycentric coord usage info.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Legacy GS has to use Wave64, so TES before GS has to use Wave64 too.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
as_ngg is required by Wave32.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This can decrease LDS and/or memory usage for shader outputs when geometry
shaders or tessellation is used.
Only PS inputs support higher indices and those aren't eliminated by
kill_outputs.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Dave Airlie <airlied@redhat.com>
- don't pass it via a parameter if it can be derived from other parameters
- set shader_type for ac_rtld_open
- use enum pipe_shader_type instead of unsigned
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Acked-by: Dave Airlie <airlied@redhat.com>
We need to tell PA to accept edge flags generated by the input assembler,
because decomposed primitives shouldn't draw inner edges.
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
With NGG, the VGT_GS_OUT_PRIM_TYPE can change without a shader change.
The VS_STATE is required for both streamout and culling from a vertex
shader without pre-compiling outprim-specific variants.
We could consider compiling specialized variants in the future. We
could also consider compiling the NGG logic as an epilog.
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>