Commit Graph

418 Commits

Author SHA1 Message Date
Jason Ekstrand f2dc0f2872 nir: Drop imov/fmov in favor of one mov instruction
The difference between imov and fmov has been a constant source of
confusion in NIR for years.  No one really knows why we have two or when
to use one vs. the other.  The real reason is that they do different
things in the presence of source and destination modifiers.  However,
without modifiers (which many back-ends don't have), they are identical.
Now that we've reworked nir_lower_to_source_mods to leave one abs/neg
instruction in place rather than replacing them with imov or fmov
instructions, we don't need two different instructions at all anymore.

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
Acked-by: Rob Clark <robdclark@chromium.org>
2019-05-24 08:38:11 -05:00
Jonathan Marek d0bff89159 nir: allow specifying a set of opcodes in lower_alu_to_scalar
This can be used by both etnaviv and freedreno/a2xx as they are both vec4
architectures with some instructions being scalar-only.

Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
2019-05-10 15:10:41 +00:00
Ian Romanick 1f1007a4ed nir: Initialize lower_flrp_progress everywhere
I don't know why I thought NIR_PASS always set the progress variable.
Derp.

Fixes: d41cdef2a5 ("nir: Use the flrp lowering pass instead of nir_opt_algebraic")
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Coverity CID: 1444996
Coverity CID: 1444995
Coverity CID: 1444994
Coverity CID: 1444993
Coverity CID: 1444991
Coverity CID: 1444989
2019-05-09 10:03:51 -07:00
Ian Romanick d41cdef2a5 nir: Use the flrp lowering pass instead of nir_opt_algebraic
I tried to be very careful while updating all the various drivers, but I
don't have any of that hardware for testing. :(

i965 is the only platform that sets always_precise = true, and it is
only set true for fragment shaders.  Gen4 and Gen5 both set lower_flrp32
only for vertex shaders.  For fragment shaders, nir_op_flrp is lowered
during code generation as a(1-c)+bc.  On all other platforms 64-bit
nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old
nir_opt_algebraic method.

No changes on any other Intel platforms.

v2: Add panfrost changes.

Iron Lake and GM45 had similar results. (Iron Lake shown)
total cycles in shared programs: 188647754 -> 188647748 (<.01%)
cycles in affected programs: 5096 -> 5090 (-0.12%)
helped: 3
HURT: 0
helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2
helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12%

Reviewed-by: Matt Turner <mattst88@gmail.com>
2019-05-06 22:52:29 -07:00
Christian Gmeiner 4e110eca42 nir: nir_shader_compiler_options: drop native_integers
Driver which do not support native integers should use a lowering
pass to go from integers to floats.

Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2019-05-07 07:35:52 +02:00
Eric Engestrom 7ca8ba199f delete autotools .gitignore files
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).

Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2019-04-29 21:17:19 +00:00
Eric Anholt fb0611df3d v3d: Fix detection of TMU write sequences in register spilling.
We can't use the QPU functions to detect this until register allocation is
done and we've moved inst->dst into inst->qpu.

Fixes bad TMU sequences from register spilling in
KHR-GLES31.core.compute_shader.shared-max.
2019-04-26 12:42:30 -07:00
Eric Anholt 18894a5e5a v3d: Fix detection of the last ldtmu before a new TMU op.
We were looking at the start instruction, instead of scanning through the
list of following instructions to find any more ldtmus.
2019-04-26 12:42:30 -07:00
Eric Anholt 575caab895 v3d: Re-add support for memory_barrier_shared.
Looks like I lost it in a rebase conflict resolution.  We'd hit the
unknown intrinsic assertion in
KHR-GLES31.core.compute_shader.shared-struct.

Fixes: 6b1c659825 ("v3d: Add Compute Shader compilation support.")
2019-04-26 12:42:30 -07:00
Eric Anholt 4358904c06 v3d: Add a note about i/o indirection for future performance work. 2019-04-26 12:42:30 -07:00
Eric Anholt 24587ae8ae v3d: Assert that we do request the normal texturing return data.
An unused tex should be DCEed, but if it wasn't we'd run into trouble with
not doing a TMUWT.
2019-04-26 12:42:30 -07:00
Eric Anholt 12f6c34806 v3d: Fix atomic cmpxchg in shaders on hardware.
In what might be my first case of finding a divergence between hardware
and simpenrose for v3d 4.x, it seems that despite what the spec claims,
you actually need specific values in the TYPE field for atomic ops.

Fixes dEQP-GLES31.functional.*.compswap.*
2019-04-18 13:24:55 -07:00
Eric Anholt 1ce143ca19 v3d: Fix an invalid reuse of flags generation from before a thrsw.
Noticed while debugging the last GLES 3.1 failure, though it doesn't seem
to affect that bug.
2019-04-18 13:24:55 -07:00
Eric Anholt 697e2e1f26 v3d: Always set up the qregs for CSD payload.
We were failing to set up payload[1] for use by LocalInvocationIndex/ID
and shared variable accesses if gl_WorkGroupID/gl_GlobalInvocationID
wasn't used (possibly because you only have one workgroup).  You're always
going to use payload[1], and payload[0] is common enough and we have DCE
in the backend to clean it up if it happens to not be used.
2019-04-16 12:10:39 -07:00
Eric Anholt 1bc71e8b65 v3d: Only look up the 3rd texture gather offset for non-arrays.
Fixes assertion failures in the CTS since Karol's cleanup when NIR started
noticing that we were reading an invalid component.

Fixes: 5450f1c9fb ("v3d: prefer using nir_src_comp_as_int over nir_src_as_const_value")
2019-04-16 12:07:59 -07:00
Dylan Baker 95aefc94a9 Delete autotools
Acked-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
Acked-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Acked-by: Matt Turner <mattst88@gmail.com>
2019-04-15 13:44:29 -07:00
Karol Herbst 14531d676b nir: make nir_const_value scalar
v2: remove & operator in a couple of memsets
    add some memsets
v3: fixup lima

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> (v2)
2019-04-14 22:25:56 +02:00
Eric Anholt dc402be73e v3d: Use the new lower_to_scratch implementation for indirects on temps.
We can use the same register spilling infrastructure for our loads/stores
of indirect access of temp variables, instead of doing an if ladder.

Cuts 50% of instructions and max-temps from 2 KSP shaders in shader-db.
Also causes several other KSP shaders with large bodies and large loop
counts to not be force-unrolled.

The change was originally motivated by NOLTIS slightly modifying register
pressure in piglit temp mat4 array read/write tests, triggering register
allocation failures.
2019-04-12 16:16:58 -07:00
Eric Anholt 8a2d91e124 v3d: Detect the correct number of QPUs and use it to fix the spill size.
We were missing a * 4 even if the particular hardware matched our
assumption.
2019-04-12 15:59:31 -07:00
Eric Anholt 11ba8a46e4 v3d: Add missing dumping for the spill offset/size uniforms. 2019-04-12 15:59:31 -07:00
Eric Anholt 42cf57f186 v3d: Add missing base offset to CS shared memory accesses.
This code is so touchy, trying to emit the minimum amount of address math.
Some day we'll move it all to NIR, I hope.
2019-04-12 15:59:31 -07:00
Eric Anholt 6b1c659825 v3d: Add Compute Shader compilation support.
While waiting for the CSD UABI to get reviewed, I keep having to rebase
the CS patch.  Just land the compiler side for now to keep it from
diverging.

For now this covers just GLES 3.1 compute shaders, not CL kernels.
2019-04-12 15:59:31 -07:00
Eric Anholt 1e0a72ce09 v3d: Replace the old shader-db env var output with the ARB_debug_output.
We're using ARB_debug_output for the main shader-db, but I had this env
var left around from the shader-db-2 support (vc4 apitrace-based).  Keep
the env var around since it's nice sometimes to get the stats on a shader
you're optimizing without having to do a shader-db run, but drop the old
formatting that's not useful and keeps tricking me when I go to add
another measurement to the shader-db output.
2019-04-12 15:59:31 -07:00
Eric Anholt b02dbaa8ce v3d: Include the number of max temps used in the shader-db output.
This gives us finer-grained feedback on how we're doing on register
pressure than "did we trigger a new shader to spill or drop thread count?"
2019-04-12 15:59:24 -07:00
Eric Anholt 89b7df552b v3d: Add and use a define for the number of channels in a QPU invocation.
A shader invocation always executes 16 channels together, so we often end
up multiplying things by this magic 16 number.  Give it a name.
2019-04-12 15:58:28 -07:00
Timothy Arceri 035759b61b nir/i965/freedreno/vc4: add a bindless bool to type size functions
This required to calculate sizes correctly when we have bindless
samplers/images.

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2019-04-12 09:02:59 +02:00
Eric Anholt 8f065596d2 v3d: Add an optimization pass for redundant flags updates.
Our exec masking introduces lots of redundant flags updates, and even
without that there will be cases where NIR comparisons on the same sources
for different reasons may generate the same comparison instruction before
the selection.

total instructions in shared programs: 6492930 -> 6460934 (-0.49%)
total uniforms in shared programs: 2117460 -> 2115106 (-0.11%)
total spills in shared programs: 4983 -> 4987 (0.08%)
total fills in shared programs: 6408 -> 6416 (0.12%)
2019-04-11 09:24:02 -07:00
Jason Ekstrand 6279074de1 nir: Get rid of global registers
We have a pass to lower global registers to locals and many drivers
dutifully call it.  However, no one ever creates a global register ever
so it's all dead code.  It's time we bury it.

Acked-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2019-04-09 00:29:36 -05:00
Karol Herbst 5450f1c9fb v3d: prefer using nir_src_comp_as_int over nir_src_as_const_value
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
2019-04-07 15:13:36 +02:00
Eric Anholt 276d22c52d v3d: Add some more new packets for V3D 4.x.
The T/G shader references and common state will be needed for GLES 3.2.
2019-04-04 17:30:35 -07:00
Eric Anholt 62360e92ec v3d: Bump the maximum texture size to 4k for V3D 4.x.
4.1 and 4.2 both have the same 16k limit, but it I'm seeing GPU hangs in
the CTS at 8k and 16k.  4k at least lets us get one 4k display working.

Cc: mesa-stable@lists.freedesktop.org
2019-04-04 17:30:35 -07:00
Eric Anholt bfed0a7099 v3d: Remove some dead members of struct v3d_compile.
These are more vc4 leftovers.
2019-03-21 14:20:50 -07:00
Eric Anholt 16f2770eb4 v3d: Upload all of UBO[0] if any indirect load occurs.
The idea was that we could skip uploading the constant-indexed uniform
data and just upload the uniforms that are variably-indexed.  However,
since the VS bin and render shaders may have a different set of uniforms
used, this meant that we had to upload the UBO for each of them.  The
first case is generally a fairly small impact (usually the uniform array
is the most space, other than a couple of FSes in shader-db), while the
second is a larger impact: 3DMMES2 was uploading 38k/frame of uniforms
instead of 18k.

Given that the optimization is of dubious value, has a big downside, and
is quite a bit of code, just drop it.  No change in shader-db.  No change
on 3DMMES2 (n=15).
2019-03-21 14:20:50 -07:00
Eric Anholt 320e96bace v3d: Move constant offsets to UBO addresses into the main uniform stream.
We'd end up with the constant offset in the uniform stream anyway, since
they're bigger than small immediates.  Avoids the extra uniforms and adds
in the shader in favor of just adding once on the CPU.

shader-db:
total instructions in shared programs: 6496865 -> 6494851 (-0.03%)
total uniforms in shared programs: 2119511 -> 2117243 (-0.11%)
2019-03-21 14:20:50 -07:00
Eric Anholt c36d2793ec v3d: Rename v3d_tmu_config_data to v3d_unit_data.
I want to reuse this for encoding small constant UBO/SSBO offsets into the
uniform stream to reduce the extra uniform loads and adds for the small
constant offsets.
2019-03-21 14:20:50 -07:00
Eric Anholt 0c874c18cd v3d: Fix leak of the mem_ctx after the DAG refactor.
Noticed while trying to get a CTS run again.

Fixes: 33886474d6 ("v3d: Use the DAG datastructure for QPU instruction scheduling.")
2019-03-12 16:15:40 -07:00
Eric Anholt 33886474d6 v3d: Use the DAG datastructure for QPU instruction scheduling.
Just a small code reduction from shared infrastructure.
2019-03-11 13:14:32 -07:00
Eric Anholt 7c01ddbf7f v3d: Reuse list_for_each_entry_rev(). 2019-03-11 13:14:32 -07:00
Eric Anholt c4d2da1f14 v3d: Include a count of register pressure in the RA failure dumps.
You usually want to go find the highest pressure and figure out why you
couldn't spill or what pattern led to a bunch of pressure leading to that
point.
2019-03-06 14:13:45 -08:00
Eric Anholt 5c655c47db v3d: Drop the V3D 3.x vpm read dead code elimination.
We now have NIR dead code eliminating our VPM reads, so this shouldn't be
necessary.
2019-03-05 12:57:39 -08:00
Eric Anholt e8ee1f8eaf v3d: Eliminate the TLB and TLBU files.
We can just use the magic register file like we do for other magic waddrs.
2019-03-05 12:57:39 -08:00
Eric Anholt 110f14d4b4 v3d: Use ldunif instructions for uniforms.
The idea is that for repeated use of the same uniform, we could avoid
loading it on each consumer.  The results look pretty good.

total instructions in shared programs: 6413571 -> 6521464 (1.68%)
total threads in shared programs: 154214 -> 154000 (-0.14%)
total uniforms in shared programs: 2393604 -> 2119629 (-11.45%)
total spills in shared programs: 4960 -> 4984 (0.48%)
total fills in shared programs: 6350 -> 6418 (1.07%)

Once we do scheduling at the NIR level, the register pressure (and thus
also instructions) issues we see here will drop back down.
2019-03-05 12:57:39 -08:00
Eric Anholt 4036fce8fd v3d: Add support for register-allocating a ldunif to a QFILE_TEMP.
On V3D 4.x, we can use ldunifrf to load uniforms to any register, and this
will let us schedule the ldunif wherever we want in the program.
2019-03-05 12:57:39 -08:00
Eric Anholt 70df388219 v3d: Drop the old class bits splitting up the accumulators.
This seems to be left over from vc4, and I don't use them any more.
2019-03-05 12:57:39 -08:00
Eric Anholt dff1fc04e0 v3d: Add support for vir-to-qpu of ldunif instructions to a temp.
We can load a uniform to any register, so add support for non-ALU
instructions with sig.ldunif to a temp.
2019-03-05 12:57:39 -08:00
Eric Anholt 4739181a16 v3d: Switch implicit uniforms over to being any qinst->uniform != ~0.
I'm not sure why I didn't do this before -- it's clearly much simpler to
add dumping of the extra thing than to have it as another implicit source.
2019-03-05 12:57:39 -08:00
Eric Anholt 1e98f02d88 v3d: Do uniform rematerialization spilling before dropping threadcount
This feels like the right tradeoff for threads vs uniforms, particularly
given that we often have very short thread segments right now:

total instructions in shared programs: 6411504 -> 6413571 (0.03%)
total threads in shared programs: 153946 -> 154214 (0.17%)
total uniforms in shared programs: 2387665 -> 2393604 (0.25%)
2019-03-05 12:57:39 -08:00
Eric Anholt 060979a380 v3d: Fix temporary leaks of temp_registers and when spilling.
On each iteration of successfully spilling a reg, we'd allocate another
copy of temp_registers, and when decrementing thread conut we'd allocate
another copy of the graph.  These all got cleaned up on freeing the
compile.
2019-03-05 12:57:39 -08:00
Eric Anholt 2780a99ff8 v3d: Move the stores for fixed function VS output reads into NIR.
This lets us emit the VPM_WRITEs directly from
nir_intrinsic_store_output() (useful once NIR scheduling is in place so
that we can reduce register pressure), and lets future NIR scheduling
schedule the math to generate them.  Even in the meantime, it looks like
this lets NIR DCE some more code and make better decisions.

total instructions in shared programs: 6429246 -> 6412976 (-0.25%)
total threads in shared programs: 153924 -> 153934 (<.01%)
total loops in shared programs: 486 -> 483 (-0.62%)
total uniforms in shared programs: 2385436 -> 2388195 (0.12%)

Acked-by: Ian Romanick <ian.d.romanick@intel.com> (nir)
2019-03-05 10:59:40 -08:00
Eric Anholt a9dd227a47 v3d: Translate f2i(fround_even) as FTOIN.
This appears to be just what the opcode does.  Needed for equivalence when
moving FF VPM stores into NIR.
2019-03-05 10:59:40 -08:00