The Gallium Nine state tracker now works on iris.
Also tested with GALLIUM_HUD and Star Wars: Knights of the Old
Republic on WINE (GL_ATI_fragment_shader).
Signed-off-by: Andre Heider <a.heider@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Fixes a leak:
==7576== 320 (48 direct, 272 indirect) bytes in 1 blocks are definitely lost in loss record 26 of 26
==7576== at 0x4C2EE3B: malloc (vg_replace_malloc.c:309)
==7576== by 0x53EF0E4: ralloc_size (ralloc.c:119)
==7576== by 0x53EF0C2: ralloc_context (ralloc.c:113)
==7576== by 0x5471F64: nir_split_per_member_structs (nir_split_per_member_structs.c:176)
==7576== by 0x51288CF: anv_shader_compile_to_nir (anv_pipeline.c:216)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Fixes leaks from anv_device_upload_nir:
==7345== 8,192 bytes in 2 blocks are definitely lost in loss record 24 of 24
==7345== at 0x4C2ED78: malloc (vg_replace_malloc.c:308)
==7345== by 0x4C31393: realloc (vg_replace_malloc.c:836)
==7345== by 0x54E0848: grow_to_fit (blob.c:67)
==7345== by 0x54E0BE5: blob_reserve_bytes (blob.c:166)
==7345== by 0x54E0C7C: blob_reserve_intptr (blob.c:186)
==7345== by 0x54704A7: nir_serialize (nir_serialize.c:1091)
==7345== by 0x512F97D: anv_device_upload_nir (anv_pipeline_cache.c:756)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Use anv_gem_munmap for unmap when softpin in use, this corresponds to
anv_gem_mmap used in anv_block_pool_expand_range. This fixes valgrind
errors seen for each pool when softpin is in use:
==25581== 262,144 bytes in 1 blocks are definitely lost in loss record 31 of 31
==25581== at 0x50E77E8: anv_gem_mmap (anv_gem.c:96)
==25581== by 0x50EEE2B: anv_block_pool_expand_range (anv_allocator.c:543)
==25581== by 0x50EEB51: anv_block_pool_init (anv_allocator.c:477)
==25581== by 0x50EF7EF: anv_state_pool_init (anv_allocator.c:920)
==25581== by 0x510B8EB: anv_CreateDevice (anv_device.c:2031)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
The NIR and TGSI paths are currently intertwined which makes it
not only hard to follow but also makes it hard to take advantage
of the differences in IR.
Here we take the first step to splitting that path apart. With
this we take the opportunity to no longer call the GLSL IR
optimisation passes after the final lowering calls for NIR. We
can instead just use the NIR passes which can produce better code
and should also result in faster compile times.
The speed-up can be measured in some dolphin uber shaders due to
no longer calling lower_if_to_cond_assign() for example
dolphin/ubershaders/120.shader_test goes from ~1.63 -> ~1.53
seconds on my machine.
There are some code changes as a result of not calling
lower_if_to_cond_assign(), this is because it flattens ifs that
contain UBOs where as NIR's peephole select doesn't. This is
were most of the regressions in Max Waves happens with shader-db.
shader-db results (VEGA):
Totals from affected shaders:
SGPRS: 2349056 -> 2349640 (0.02 %)
VGPRS: 1322160 -> 1323300 (0.09 %)
Spilled SGPRs: 21190 -> 21527 (1.59 %)
Spilled VGPRs: 99 -> 99 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 72 -> 72 (0.00 %) dwords per thread
Code Size: 57260904 -> 57270932 (0.02 %) bytes
Compile Time: 1107186 -> 1022942 (-7.61 %) milliseconds
LDS: 786 -> 786 (0.00 %) blocks
Max Waves: 391932 -> 391619 (-0.08 %)
Wait states: 0 -> 0 (0.00 %)
Reviewed-by: Eric Anholt <eric@anholt.net>
glsl_to_nir() is still missing support for converting certain
functions to NIR, so for those we use the GLSL IR optimisations
to remove the functions.
Reviewed-by: Eric Anholt <eric@anholt.net>
v2: reimplement layer chain info getters (Eric)
v3: make it compile.. (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
This will be used by the overlay instead of system installed
validation layers helpers.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Eric Engestrom <eric.engestrom@intel.com>
If Vertex Shader uses EdgeFlag the hardware request that it is setup
as the last VERTEX_ELEMENT_STATE. If SGVS are add at draw time we
need to also reconfigure the last 3DSTATE_VF_INSTANCING so its
VertexElementIndex points to the new Vertex Element that contains
the EdgeFlag.
So if draw parameters or edgeflag are not used the CSO generated at
iris_create_vertex_element is sent directly in the batches. But if
edge flag is used we adjust last VERTEX_ELEMENT_STATE and
last 3DSTATE_VF_INSTANCING using their alternative edge flag version
we generate at iris_create_vertex_element and store at the CSO.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
You usually want to go find the highest pressure and figure out why you
couldn't spill or what pattern led to a bunch of pressure leading to that
point.
Fixes: 58bcebd987 ("spirv: Allow [i/u]mulExtended to use new nir opcode")
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Now that we have a loop unrolling cost function and loop unrolling isn't
going to kill us the moment we have a 64-bit op in a loop, we can go
ahead and move 64-bit lowering later. This gives us the opportunity to
do more optimizations and actually let the full optimizer run even on
64-bit ops rather than hoping one round of opt_algebraic will fix
everything. This substantially reduces both fp64 shader compile times
and the resulting code size.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Now that we have a loop unrolling cost function and loop unrolling isn't
going to kill us the moment we have a 64-bit op in a loop, we can go
ahead and move 64-bit lowering later. This gives us the opportunity to
do more optimizations and actually let the full optimizer run even on
64-bit ops rather than hoping one round of opt_algebraic will fix
everything. This substantially reduces both fp64 shader compile times
and the resulting code size. On the vs-isnan-dvec test from piglit:
Before this commit:
1684.63s user 17.29s system 99% cpu 28:28.24 total
101479 instructions. 0 loops. 802452 cycles. 79:369 spills:fills.
Peak memory usage (according to massif): 1.435 GB
After this commit:
179.64s user 7.75s system 99% cpu 3:07.92 total
57316 instructions. 0 loops. 459287 cycles. 0:0 spills:fills.
Peak memory usage (according to massif): 531.0 MB
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of trusting the caller to already have created a softfp64
function shader and added all its functions to our shader, we simply
take the softfp64 shader as an argument and do the function inlining
ouselves. This means that there's no more nasty functions lying around
that the caller needs to worry about cleaning up.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This pulls the guts of function inlining into a builder helper so that
it can be used elsewhere. The rest of the infrastructure is still
needed for most inlining cases to ensure that everything gets inlined
and only ever once. However, there are use-cases where you just want to
inline one little thing. This new helper also has a neat trick where it
can seamlessly inline a function from one nir_shader into another.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This doesn't really change anything as the functions will all get
inlined anyway. However it does let us do a bit of the work earlier and
in a common place.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Even though this is technically a step in the function inlining process
as laid out in nir_inline_functions.c, it's not really needed. We
already have constant initializers lowered here and no new ones are
added by appending the softfp64 functions.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Instead of looking the devinfo directly, look at the lowering options we
provided to NIR. This is more accurate as it's now checking for "do we
need full software lowering" rather than a hardware bit.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The lowering we do for 64-bit instructions can cause a single NIR ALU
instruction to blow up into hundreds or thousands of instructions
potentially with control flow. If loop unrolling isn't aware of this,
it can unroll a loop 20 times which contains a nir_op_fsqrt which we
then lower to a full software implementation based on integer math.
Those 20 invocations suddenly get a lot more expensive than NIR loop
unrolling currently expects. By giving it an approximate estimate
function, we can prevent loop unrolling from going to town when it
shouldn't.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
We already have one internally for int64 but we don't have a similar one
for doubles so we'll have to make one.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
This is set to True only for numeric conversion opcodes.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
In the old code, we would generate the exact same instruction for
extract_u8(some_u64, 0) and extract_u8(some_u64, 1). The mask-a-word
trick only works for even numbered bytes.
This fixes the (new) piglit test
tests/spec/arb_gpu_shader_int64/execution/fs-ushr-and-mask.shader_test.
v2: Use a SHR instead of an AND. This saves an instruction compared to
using two moves. Suggested by Jason.
Fixes: 6ac2d16901 ("i965/fs: Fix extract_i8/u8 to a 64-bit destination")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The parameter is never used, and it's not part of a common interface
idiom. Remove it.
src/intel/compiler/brw_interpolation_map.c: In function ‘brw_setup_vue_interpolation’:
src/intel/compiler/brw_interpolation_map.c:62:59: warning: unused parameter ‘devinfo’ [-Wunused-parameter]
const struct gen_device_info *devinfo)
^~~~~~~
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
In 61e009d2c4 we changed the number of components in the
vulkan_resource_index intrinsic and forgot the update Radv's code for
it.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 61e009d2c4 ("spirv: Use the same types for resource indices as pointers")
Reviewed-by: Samuel Pitoiset samuel.pitoiset@gmail.com
Not complete, mostly just adding things as I encounter them in CTS. But
not getting far enough yet to hit most of the OpenCL.std instructions.
Anyway, this is better than nothing and covers the most common builtins.
v2: add hadd proof from Jason
move some of the lowering into opt_algebraic and create new nir opcodes
simplify nextafter lowering
fix normalize lowering for inf
rework upsample to use nir_pack_bits
add missing files to build systems
v3: split lines of iadd/sub_sat expressions
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
v2: use formula with fewer operations
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
v2: add assert in else clause
make local group intrinsics 32 bit wide
v3: always use 32 bit constant for local_size
v4: add comment by Jason
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
we define it inside 'include/c99_math.h' so it is safe to use.
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The idea is that for repeated use of the same uniform, we could avoid
loading it on each consumer. The results look pretty good.
total instructions in shared programs: 6413571 -> 6521464 (1.68%)
total threads in shared programs: 154214 -> 154000 (-0.14%)
total uniforms in shared programs: 2393604 -> 2119629 (-11.45%)
total spills in shared programs: 4960 -> 4984 (0.48%)
total fills in shared programs: 6350 -> 6418 (1.07%)
Once we do scheduling at the NIR level, the register pressure (and thus
also instructions) issues we see here will drop back down.