Right now if the shader indirects on some large constant array, we see NIR
load_consts (usually from the const file) of its contents into general
registers, then indirection on the GPRs. This often results in register
allocation failures, as it's easy to go beyond the ~256 dwords of
registers per invocation.
By moving the large constants to a UBO, we can load an arbitrary number of
them. They also can be theoretically moved to the constant reg file (~2k
dwords), though you're unlikely to hit this path without an indirect load
on your large constant, and we don't yet let UBO indirect loads get moved
to constant regs.
This possibly won't work out right if we have 16-bit load_constants, but
without other MRs in flight we won't see 16-bit temps to be lowered to
this.
This allows 2 kerbal-space-program shaders to compile that previously
would fail, and fixes the new dEQP-VK and -GLES2 tests I wrote that
dynamically index a 40-element temporary array of float/vec2/vec3/vec4
with constant element initializers.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2789
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5810>
With indirect load_uniform, we can only encode 10b of constant base
offset. This pass detects problematic cases and peels out the high
bits of the base offset.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7612>
Clip & cull distances, which are compact arrays, exposed a lot of holes
because they can take up multiple slots and partially overlap.
I wanted to eliminate our dependence on knowing the layout of the
variables, as this can get complicated with things like partially
overlapping arrays, which can happen with ARB_enhanced_layouts or with
clip/cull distance arrays. This means no longer changing the layout
based on whether the i/o is part of an array or not, and no longer
matching producer <-> consumer based on the variables. At the end of the
day we have to match things based on the user-specified location, so for
simplicity this switches the entire i/o handling to be based off the
user location rather than the driver location. This means that the
primitive map may be a little bigger, but it reduces the complexity
because we never have to build a table mapping user location to driver
location, and it reduces the amount of work done at link time in the SSO
case. It also brings us closer to what the other drivers do.
While here, I also fixed the handling of component qualifiers, which was
another thing broken with clip/cull distances.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6959>
The next step is to hook this into pscreen->finalize_nir() so it can
come before the state tracker's shader-caching.
Unfortunately we still need to do lower_io after mesa/st, so that is
split out into a post-finalize pass.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5372>
Since binning pass variants share the same const_state with their
draw-pass counterpart, we should re-use the draw-pass variant's ubo
range analysis. So split the two functions of the existing pass
into two parts.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5526>
For shader-cache, we want to not have anything important in `ir3_shader`.
And to have shader variants with lower const size limits (to properly
handle cross-stage limits), we also want variants to be able to have
their own const_state.
But we still need binning pass shaders to align with their draw pass
counterpart so that the same const emit can be used for both passes.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5508>
It seems a lot of the lowerings being run the second time were
unnecessary. In addition, when const_state is moved to the variant,
then it will become impossible to know ahead of time whether a variant
needs additional optimizing, which means that ir3_key_lowers_nir() needs
to go away. The new approach should have the same effect, since it skips
running lowerings that are unnecessary and then skips the opt loop if no
optimizations made progress, but it will work better when we move
ir3_nir_analyze_ubo_ranges() to be after variant creation.
The one maybe controversial thing I did is to make
nir_opt_algebraic_late() always happen during variant lowering. I wanted
to avoid code duplication, and it seems to me that we should push the
_late variants as far back as possible so that later opt_algebraic runs
don't miss out on optimization opportunities.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5508>
It saves addressing math, but may cause multiple loads to be done and
bcseled due to NIR not giving us good address alignment information
currently. I don't have any workloads I know of using non-const-uploaded
UBOs, so I don't have perf numbers for it
This makes us match the GLES blob's behavior, and turnip (other than being
bindful).
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4858>
We mostly got away with replacing a store_output with a store_var, but
for complex types like structs, that doesn't work. Once the IO has
been lowered from vars to intrinsic, we've lost the deref chains and
can't properly shadow the outputs.
This commits moves the GS lowering up so we do it before the output
variables get lowered to store_output. This way the pass works much
like nir_lower_io_to_temporaries() and cleanly shadows the outputs.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4562>
This pass lowers per-vertex input intrinsics to load_shared_ir3. This
was open coded in the TCS and GS lowering passes before - this way we
can share it. Furthermore, we'll need to run the rest of the GS
lowering earlier (before lowering IO) so we need to split off this
part that operates on the IO intrinsics first.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4562>
This commit updates tu6_emit_vpc to selectively emit GS-specifc
configuration. Most of this is repurposed from fd6_program.c.
This also refactors `link_geometry_stages` to ir3_nir_lower_tess.c
so it can be shared between fd and tu.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4436>
VS and TCS pass varyings the same way as VS and GS does. TCS then
writes entire patch to a system memory BO and TES eventually reads
back from the BO once the TE starts generating vertices. TES outputs
vertices the same way as VS and GS, except when there's a GS as well,
in which case TES passes varyings to GS same way the VS would.
In addition, the TCS needs a little bit of control flow massaging so
that it only runs for valid invocations needs a couple of unknown
instructions to synchronize with the TE.
Signed-off-by: Kristian H. Kristensen <hoegsberg@google.com>
Acked-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Rob Clark <robdclark@gmail.com>
The pass should run once at the end of shader compilation, for a4xx
onwards. It iterates texture sampling instructions and mark those
eligibile for pre-dispatch by changing the tex op from 'tex' to
'tex_prefetch'. An instruction is eligibile if:
* The coordinate is a vector where all its components come from a
shader input.
* The order of the components match exactly that of the input (no
swizzles).
* The instruction is in the 'main' function, and in the outer
most-block.
The first two restrictions were arrived to empirically, so more
testing could tighten or loosen it.
The 3rd restriction is there to allow moving the instructions
eligible for pre-dispatch to the beginning of the shader, so
that we don't block the registers holding the result for too
long.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
This introduces two new lowering passes. One to lower VS to explicit
outputs using STLW and one to lower GS to load input using LDLW and
implement the GS specific functionality.
Signed-off-by: Kristian H. Kristensen <hoegsberg@google.com>
We only ever return the shader we were passed in (but internally
modified).
Reviewed-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Currently, ir3 backend compiler is lowering integer multiplication from:
dst = a * b
to:
dst = (al * bl) + (ah * bl << 16) + (al * bh << 16)
by emitting this code:
mull.u tmp0, a, b ; mul low, i.e. al * bl
madsh.m16 tmp1, a, b, tmp0 ; mul-add shift high mix, i.e. ah * bl << 16
madsh.m16 dst, b, a, tmp1 ; i.e. al * bh << 16
which at that point has very low chances of being optimized.
This patch adds a new nir_algebraic.AlgebraicPass to performs this
lowering during NIR algebraic optimization passes, giving it a better
chance for optimizing the resulting code.
Reviewed-by: Eric Anholt <eric@anholt.net>
For a6xx, we construct/emit a single VS const state used for both
binning pass and draw pass. So far we were mostly getting lucky that
there were not (obvious) mismatches between the const_state (like
different lowered immediates) between the binning and draw pass
VS ir3_shader_variant.
And I guess this situation will come up more as GS and tess is added
into the equation.
Since really everything about the const state is not specific to the
variant, move this. The main exception is lowered immediates, but these
are the last to appear in the layout, and it doesn't hurt for each new
shader variant to just append any immed's it lowers to the end of the
immediate state.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Next patch moves const_state to ir3_shader, before the compile context
is created. So move the code around in prep to call it earlier.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Combine the offsets of differenet parts of the constant space with (what
was formerly known as) ir3_driver_const_layout. Bunch of churn, but no
functional change.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Calculates i,j at specified offset within a pixel. A new load_size_ir3
intrinsic is used in conjunction with fddx/fddy to translate the offset
into primitive space and adjust the i,j from load_barycentric_pixel
accordingly.
Signed-off-by: Rob Clark <robdclark@chromium.org>
This commit turns on the gallium cap and adds a pass to lower the
load_ubo intrinsics for block 0 back to load_uniform intrinsics and
adjust the backend where the cap switches units from vec4s to dwords.
As we stop using ir3_glsl_type_size() for uniform layout, this also
corrects an issue where we would allocate a vec4 slot for samplers in
uniforms, fixing:
dEQP-GLES3.functional.shaders.struct.uniform.sampler_array_fragment
dEQP-GLES3.functional.shaders.struct.uniform.sampler_array_vertex
dEQP-GLES3.functional.shaders.struct.uniform.sampler_nested_fragment
dEQP-GLES2.functional.shaders.struct.uniform.sampler_nested_vertex
dEQP-GLES2.functional.shaders.struct.uniform.sampler_nested_fragment
Signed-off-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Reviewed-by: Rob Clark <robdclark@gmail.com>
This NIR->NIR pass implements offset computations that are currently
done on the IR3 backend compiler, to give NIR a better chance of
optimizing them.
For now, it supports lowering the dword-offset computation for SSBO
instructions. It will take an SSBO intrinsic and replace it with the
new ir3-specific version that adds an extra source. That source will
hold the SSA value resulting from inserting a division by 4 (an SHR op)
of the original byte-offset source already provided by NIR in one of
the intrinsic sources.
Note that on a6xx the original byte-offset is not needed, so we could
potentially replace that source instead of adding a new one. But to
keep things simple and consistent we always add the new source and
a6xx will just ignore the original one.
Reviewed-by: Rob Clark <robdclark@gmail.com>
Move (most of) the ir3 compiler to src/freedreno/ir3 so that it can be
re-used by some future vulkan driver. The parts that are gallium
specific have been refactored out and remain in the gallium driver.
Getting the move done now so that it can happen before further
refactoring to support a6xx specific instructions.
NOTE also removes ir3_cmdline compiler tool from autotools build since
that was easier than fixing it and I normally use meson build. Waiting
patiently for the day that we can remove *everything* from the autotools
build.
Signed-off-by: Rob Clark <robdclark@gmail.com>
2018-11-27 15:44:02 -05:00
Renamed from src/gallium/drivers/freedreno/ir3/ir3_nir.h (Browse further)