Only strip leading underscores if there's also a trailing @
Fixes shared-glapi symbol check for x64
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12881>
Now that they're no longer ralloc'd, we have to be much more careful
about indirects. We have to make sure every time a source or
destination is overwritten, its indirect (if any) is freed. We also
have to choose a memory ownership convention for the rewrite functions.
Assuming that they will be called with the source from some other
instruction, we choose to always make a copy of the indirect (if any).
It's the responsibility of the caller to ensure its copy of the indirect
is freed.
Unfortunately, all this extra logic is going to make
nir_instr_rewrite/move_src/dest more expensive because they now have
all the logic of nir_src/dest_copy instead of a simple struct
assignment. Fortunately, the vast majority of rewrite calls are done by
nir_ssa_def_rewrite_uses which is an SSA-only fast-path.
Fixes: 879a569884 "nir: Switch from ralloc to malloc for NIR instructions."
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12884>
PSIZ output is only needed when:
1. There is a next stage and it reads it.
2. Primitive topology is point list, in the last vertex pipeline stage.
Zink always adds this output in its vertex (and other) shaders,
because it helps Zink avoid recompiling shader variants.
However, this has a performance impact for RADV because
it needs a scalar memory load. That becomes noticeable
at high primitive rates.
The Fossil stats are unremarkable because our DB doesn't include any
shaders from Zink or D9VK, but there are a few affected shaders.
Note that there may be an increase in LDS use in some GS. This is
because with PSIZ removed the ES per-vertex LDS size is smaller, so
we can squeeze more GS threads in the same workgroup.
Fossil DB stats on Sienna Cichlid:
Totals from 14 (0.01% of 128647) affected shaders:
CodeSize: 119884 -> 119732 (-0.13%)
LDS: 235008 -> 228864 (-2.61%); split: -2.83%, +0.22%
Instrs: 23076 -> 23048 (-0.12%)
Latency: 71667 -> 71625 (-0.06%)
InvThroughput: 19155 -> 18870 (-1.49%)
Copies: 1586 -> 1572 (-0.88%)
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-By: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10725>
The fails will be addressed later.
This adds a fail in GLSL compiler that is due to a workaround
that fails when fp16 constants are lowered
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11816>
Otherwise the inf translations don't seem to work, and the VK CTS
fails
Fixes VK CTS dEQP-VK.spirv_assembly.instruction.graphics.float16.arithmetic*
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11816>
Some VK CTS tests are topping this out around 76, increase it to 80 for now.
Fixes:
dEQP-VK.spirv_assembly.instruction.graphics.float16.opvectorshuffle.*44*
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11816>
This is an initial patch that is needed for OpenCL and Vulkan
support for proper 16-bit floats.
This doesn't enable the cap bit yet
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11816>
this should be significantly more performant for the majority of cases
since it's rare that shaders have multiple variants outside of unit tests,
so now there can just be a list of shaders being iterated instead where the
first entry is the last used
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12842>
Get comfy.
llvmpipe coroutines have a stack frame. This is created by hooking
in malloc and coro.alloc and coro.size intrinsics.
LLVM has an CoroElide pass that is meant to allow that stack frame
to be done as an alloca in the caller instead of using the malloc path.
The CoroElide pass relies on the coroutine being inlined (fixed that).
The CoroElide pass relies on there being a direct connect between
coro.destroy(i8 *arg) and arg = coro.begin(id). However due to the
way the compute shaders are launched, there is no way to ensure that
link. Fixing the CoroElide pass seems quite difficult, I considered
having a force CoroElide always flag to make it dtrt, however I'm not
sure how ugly that would end up.
My first attempt tried to preallocate the stacks at a fixed size,
this turned out to be naive as the stack frame size was not sized
like I expected. Instead the first coro to run allocs enough for
everyone, so avoid the massive amounts of small allocations.
This remove coro malloc from a lot of profiles and shaves another 30s
or so from OpenCL ./conversions/test_conversions uchar_uin
(from 4.40m to just under 4m on my ryzen 7 1800x)
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12432>
This helps reduced the mtx lock/unlock overheads for the threadpool
if the work evenly distributes across the number of threads.
The CL CTS conversions tests really hit this, and this takes maybe 10-20s
off a 5min test run.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12432>
the mask can't entirely be calculated based on the integer parameters,
as it's possible for some of the "bind" slots to actually be unbinds,
so remove bits as necessary to fix this
also add some debug asserts to ensure I don't break this again for the
tenth time
Fixes: 6dd02a5139 ("zink: stop using util_set_vertex_buffers_mask()")
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12871>
Although the index has to be dynamically uniform, if we don't ever
execute a few lanes then we'll have 0, so it important to read the
ssbo index from the first active lane.
Just loop over them all.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12689>