Commit Graph

656 Commits

Author SHA1 Message Date
Francisco Jerez acf98ff933 intel/fs: Teach instruction scheduler about GRF bank conflict cycles.
This should allow the post-RA scheduler to do a slightly better job at
hiding latency in presence of instructions incurring bank conflicts.
The main purpuse of this patch is not to improve performance though,
but to get conflict cycles to show up in shader-db statistics in order
to make sure that regressions in the bank conflict mitigation pass
don't go unnoticed.

Acked-by: Matt Turner <mattst88@gmail.com>
2017-12-07 15:56:49 -08:00
Francisco Jerez af2c320190 intel/fs: Implement GRF bank conflict mitigation pass.
Unnecessary GRF bank conflicts increase the issue time of ternary
instructions (the overwhelmingly most common of which is MAD) by
roughly 50%, leading to reduced ALU throughput.  This pass attempts to
minimize the number of bank conflicts by rearranging the layout of the
GRF space post-register allocation.  It's in general not possible to
eliminate all of them without introducing extra copies, which are
typically more expensive than the bank conflict itself.

In a shader-db run on SKL this helps roughly 46k shaders:

   total conflicts in shared programs: 1008981 -> 600461 (-40.49%)
   conflicts in affected programs: 816222 -> 407702 (-50.05%)
   helped: 46234
   HURT: 72

The running time of shader-db itself on SKL seems to be increased by
roughly 2.52%±1.13% with n=20 due to the additional work done by the
compiler back-end.

On earlier generations the pass is somewhat less effective in relative
terms because the hardware incurs a bank conflict anytime the last two
sources of the instruction are duplicate (e.g. while trying to square
a value using MAD), which is impossible to avoid without introducing
copies.  E.g. for a shader-db run on SNB:

   total conflicts in shared programs: 944636 -> 623185 (-34.03%)
   conflicts in affected programs: 853258 -> 531807 (-37.67%)
   helped: 31052
   HURT: 19

And on BDW:

   total conflicts in shared programs: 1418393 -> 987539 (-30.38%)
   conflicts in affected programs: 1179787 -> 748933 (-36.52%)
   helped: 47592
   HURT: 70

On SKL GT4e this improves performance of GpuTest Volplosion by 3.64%
±0.33% with n=16.

NOTE: This patch intentionally disregards some i965 coding conventions
      for the sake of reviewability.  This is addressed by the next
      squash patch which introduces an amount of (for the most part
      boring) boilerplate that might distract reviewers from the
      non-trivial algorithmic details of the pass.

The following patch is squashed in:

SQUASH: intel/fs/bank_conflicts: Roll back to the nineties.

Acked-by: Matt Turner <mattst88@gmail.com>
2017-12-07 15:56:06 -08:00
Jose Maria Casanova Crespo a1e257a5bf i965/fs: Use untyped_surface_read for 16-bit load_ssbo
SSBO loads were using byte_scattered read messages as they allow
reading 16-bit size components. byte_scattered messages can only
operate one component at a time so we needed to emit as many messages
as components.

But for vec2 and vec4 of 16-bit, being multiple of 32-bit we can use the
untyped_surface_read message to read pairs of 16-bit components using only
one message. Once each pair is read it is unshuffled to return the proper
16-bit components. vec3 case is assimilated to vec4 but the 4th component
is ignored.

16-bit scalars are read using one byte_scattered_read message.

v2: Removed use of stride = 2 on sources (Jason Ekstrand)
    Rework optimization using unshuffle 16 reads (Chema Casanova)
v3: Use W and D types insead of HF and F in shuffle to avoid rounding
    erros (Jason Ekstrand)
    Use untyped_surface_read for 16-bit vec3. (Jason Ekstrand)
v4: Use subscript insead of chaging type and stride  (Jason Ekstrand)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo ce2e572c4c i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg
Currently, we use byte-scattered write messages for storing 16-bit
into an SSBO. This is because untyped surface messages have a fixed
32-bit size.

This patch optimizes these 16-bit writes by combining 2 values (e.g,
two consecutive components aligned with 32-bits) into a 32-bit register,
packing the two 16-bit words.

16-bit single component values will continue to use byte-scattered
write messages. The same will happens when the first consecutive
component is not aligned 32-bits.

This optimization reduces the number of SEND messages used for storing
16-bit values potentially by 2 or 4, which cuts down execution time
significantly because byte-scattered writes are an expensive
operation as they only write a component for message.

v2: Removed use of stride = 2 on sources (Jason Ekstrand)
    Rework optimization using shuffle 16 write and enable writes
    of 16bit vec4 with only one message of 32-bits. (Chema Casanova)
v3: - Fix coding style (Eduardo Lima)
    - Reorganize code to avoid duplication. (Jason Ekstrand)
    - Include new comments to explain the length calculations to
      fix alignment issues of components. (Jason Ekstrand)
    - Fix issues with writemask yz with 16-bit writes. (Jason Ektrand)
v4: (Jason Ekstrand)
    - Reorganize 64-bit ssbo-writes to avoid using slots_per_component.
    - Comment about why suffle is needed when using byte_scattered_write.

Signed-off-by: Eduardo Lima <elima@igalia.com>
Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jason Ekstrand 3282309f74 i965/fs: Enables 16-bit load_ubo with sampler
load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
surface format defined. So when reading 16-bit components with the
sampler we need to unshuffle two 16-bit components from each 32-bit
component.

Using the sampler avoids the use of the byte_scattered_read message
that needs one message for each component and is supposed to be
slower.

v2: (Jason Ekstrand)
    - Simplify component selection and unshuffling for different bitsizes
    - Remove SKL optimization of reading only two 32-bit components when
      reading 16-bits types.

Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo 3db31c0b06 i965/fs: Helpers for un/shuffle 16-bit pairs in 32-bit components
This helpers are used to load/store 16-bit types from/to 32-bit
components.

The functions shuffle_32bit_load_result_to_16bit_data and
shuffle_16bit_data_for_32bit_write are implemented in a similar
way than the analogous functions for handling 64-bit types.

v1: Explain need of temporary in shuffle operations. (Jason Ekstrand)

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo fa4a9d63bb i965/fs: Use byte scattered read for 16-bit load_ssbo
Used to enable 16-bit reads at do_untyped_vector_read, that is used on
the following intrinsics:

   * nir_intrinsic_load_shared
   * nir_intrinsic_load_ssbo

v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand)

v3: - Add bitsize to scattered read operation (Jason Ekstrand)
    - Remove implementation of 16-bit UBO read from this patch.
    - Avoid assertion at opt_algebraic caused by ADD of two IMM with
      offset with BRW_REGISTER_TYPE_UD type found on matrix tests.
      (Jose Maria Casanova)
v4: (Jason Ekstrand)
    - Put if case for 16-bits at the beginning of the if ladder.
    - Use type_sz(dest.type) * 8 as bit_size parameter for scattered read.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo c57a3f200d i965/fs: Add byte scattered read message and fs support
v2: Fix alignment style (Topi Pohjolainen)
    (Jason Ekstrand)
    - Enable bit_size parameter to scattered messages to enable different
      bitsizes byte/word/dword.
    - Remove use of brw_send_indirect_scattered_message in favor of
      brw_send_indirect_surface_message.
    - Move scattered messages to surface messages namespace.
    - Assert align1 for scattered messages and assume Gen8+.
    - Inline brw_set_dp_byte_scattered_read.

v3: (Jason Ekstrand)
    - Use renamed brw_byte_scattered_data_element_from_bit_size method
    - Assert scattered read for Gen8+ and Haswell.
    - Use conditional expresion at components_read.
    - Include comment about params for scattered opcodes.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro a4031bdfa9 i965/fs: Predicate byte scattered writes if needed
While on Untyped Surface messages the bits of the execution mask are
ANDed with the corresponding bits of the Pixel/Sample Mask, that is
not the case for byte scattered writes. That is needed to avoid ssbo
stores writing on helper invocations. So when that can affect, we load
the sample mask, and predicate the send message.

Note: the need for this patch was tested with a custom test. Right now
the 16 bit storage CTS tests doesnt need this path in order to get a
full pass.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro 96f1926aab i965/fs: Use byte_scattered_write on 16-bit store_ssbo
We need to rely on byte scattered writes as untyped writes are 32-bit
size. We could try to keep using 32-bit messages when we have two or
four 16-bit elements, but for simplicity sake, we use the same message
for any component number. We revisit this aproach in the follwing
patches.

v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand)

v3: (Jason Ekstrand)
    - Include bit_size to scattered write message and remove namespace
    - specific for scattered messages.
    - Move comment to proper place.
    - Squashed with i965/fs: Adjust type_size/type_slots on store_ssbo.
    (Jose Maria Casanova)
    - Take into account that get_nir_src returns now WORD types for
      16-bit sources instead of DWORD.
v4: (Jason Ekstrand)
    - Rename lenght variable to num_components.
    - Include assertions before emit_untyped_write.
    - Remove type_slot in favor of num_slot and first_slot.

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo f1a9936ee1 i965/fs: Add byte scattered write message and fs support
v2: (Jason Ekstrand)
    - Enable bit_size parameter to scattered messages to enable different
      bitsizes byte/word/dword.
    - Remove use of brw_send_indirect_scattered_message in favor of
      brw_send_indirect_surface_message.
    - Move scattered messages to surface messages namespace.
    - Assert align1 for scattered messages and assume Gen8+.
    - Inline brw_set_dp_byte_scattered_write.
v3: - Remove leftover newline (Topi Pohjolainen)
    - Rename brw_data_size to brw_scattered_data_element and use
      defines instead of an enum (Jason Ekstrand)
    - Assert scattered write for Gen8+ and Haswell (Jason Ekstrand)

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro d038deaa40 i965/fs: Add remove_extra_rounding_modes optimization
Although from SPIR-V point of view, rounding modes are attached to the
operation/destination, on i965 it is a status, so we don't need to
explicitly set the rounding mode if the one we want is already set.

Taking into account that the default mode is RTE, one possible
optimization would be optimize out the first RTE set for each
block. For in order to work, we would need to take into account block
interrelationships. At this point, it is not worth to complicate the
optimization for such small gain.

v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
    with the rounding mode (Curro)
v3: Reset optimization for every block. (Jason Ekstrand)

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro 82fa4d45e7 i965/fs: Enable rounding mode on f2f16 ops
By default we don't set the rounding mode. We only set
round-to-near-even or round-to-zero mode if explicitly set from nir.

v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
    with the rounding mode (Curro)

v3: Use new helper brw_rnd_mode_from_nir_op  (Jason Ekstrand)

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro d6cd14f213 i965/fs: Define new shader opcode to set rounding modes
Although it is possible to emit them directly as AND/OR on brw_fs_nir,
having a specific opcode makes it easier to remove duplicate settings
later.

v2: (Curro)
  - Set thread control to 'switch' when using the control register
  - Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
    with the rounding mode.
  - Avoid magic numbers setting rounding mode field at control register.
v3: (Curro)
  - Remove redundant and add missing whitespace lines.
  - Match printing instruction to IR opcode "rnd_mode"

v4: (Topi Pohjolainen)
  - Fix code style.

Signed-off-by:  Alejandro Piñeiro <apinheiro@igalia.com>
Signed-off-by:  Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo ac8d4734f6 i965: Add support for control register
Control register cr0 in i965 can be used to change the rounding modes
in 32-bit to 16-bit floating-point conversions.

From intel Skylake PRM, vol 07, section "Register and Tegister Regions",
 subsection "Control Register" (page 754):

"Subregister cr0.0:ud contains normal operation control fields such as the
 floating-point mode ... "

Floating-point Rounding mode is changed at bits 5:4 of cr0.0:

"Rounding Mode. This field specifies the FPU rounding mode. It is
initialized by Thread Dispatch."
  00b = Round to Nearest or Even (RTNE)
  01b = Round Up, toward +inf (RU)
  10b = Round Down, toward -inf (RD)
  11b = Round Toward Zero (RTZ)"

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro 5d5ee507fb i965/fs: Handle 32-bit to 16-bit conversions
Conversions to 16-bit need having aligment between the 16-bit
and 32-bit types. So the conversion operations unpack 16-bit types
to with an stride=2 and then applies a MOV with the conversion.

v2 (Jason Ekstrand):
  - Avoid the general use of stride=2 for 16-bit register types.

v3 (Topi Pohjolainen)
  - Code style fix
   (Jason Ekstrand)
  - Now nir_op_f2f16 was renamed to nir_op_f2f16_undef
    because conversion to f16 with undefined rounding is explicit

Signed-off-by: Eduardo Lima <elima@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com>
Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro a05b6f25bf i965/fs: Remove BRW_REGISTER_TYPE_HF assert at get_exec_type
Note that we don't remove the assert at i965/vec4. At this point half
float support is only for the scalar backend.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Jose Maria Casanova Crespo 75a88d8567 i965: Support for 16-bit base types in helper functions
v2: Fixed calculation of scalar size for 16-bit types. (Jason Ekstrand)
v3: Fix coding style (Topi Pohjolainen)

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Eduardo Lima <elima@igalia.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Alejandro Piñeiro 2d28ca7000 i965/vec4: Handle 16-bit types at type_size_xvec4
These types have similar vec4 sizes as their 32-bit counterparts.

The vec4 backend doesn't support 16-bit types and probably never will,
but this method is called by the scalar backend at
fs_visitor::nir_setup_outputs(), so we still need to provide valid vec4
sizes for 16-bit types. In the future, something different should be
implemented to avoid this dependency.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-12-06 08:57:18 +01:00
Rafael Antognolli 2919adffe9 intel/compiler: Implement WaClearTDRRegBeforeEOTForNonPS.
The bspec describes:

   "WA: Clear tdr register before send EOT in all non-PS shader kernels

   mov(8) tdr0:ud 0x0:ud {NoMask}"

Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-12-01 11:27:27 -08:00
Iago Toral Quiroga 8620f7ebbc i965/vec4: use a temp register to compute offsets for pull loads
64-bit pull loads are implemented by emitting 2 separate
32-bit pull load messages, where the second message loads from
an offset at +16B.

That addition of 16B to the original offset should not alter the
original offset register used as source for the pull load instruction
though, since the compiler might use that same offset register in other
instructions (for example, for other pull loads in the shader code
that take that same offset as reference).

If the pull load is 32-bit then we only need to emit one message and
we don't need to do offset calculations, but in that case the optimizer
should be able to drop the redundant MOV.

Fixes the following test on Haswell:
KHR-GL45.gpu_shader_fp64.fp64.max_uniform_components

Reviewed-by: Matt Turner <mattst88@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103007
2017-11-30 07:57:53 +01:00
Iago Toral Quiroga f1873956db i965/vec4: fix splitting of interleaved attributes
When we split an instruction that reads an uniform value
(vstride 0) we need to respect the vstride on the second
half of the instruction (that is, the second half should
read the same region as the first).

We were doing this already, but we didn't account for
stages that have interleaved input attributes which also
have a vstride of 0 and need the same treatment.

Fixes the following on Haswell:
KHR-GL45.enhanced_layouts.varying_locations
KHR-GL45.enhanced_layouts.varying_array_locations
KHR-GL45.enhanced_layouts.varying_structure_locations

Reviewed-by: Matt Turner <mattst88@gmail.com>
Acked-by: Andres Gomez <agomez@igalia.com>
2017-11-24 09:24:06 +01:00
Matt Turner beaea7abfa i965/fs: Check ADD/MAD with immediates in satprop unit test
The gen had to be changed from 4 to 6 so that we could test MAD, which
is new on Gen6.

mad_imm_float_neg_mov_sat tests the case fixed by the previous commit.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-11-21 10:13:07 -08:00
Matt Turner a05af1f7b8 i965/fs: Handle negating immediates on MADs when propagating saturates
MADs don't take immediate sources, but we allow them in the IR since it
simplifies a lot of things. I neglected to consider that case.

Fixes: 4009a9ead4 ("i965/fs: Allow saturate propagation to propagate
                      negations into MADs.")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103616
Reported-and-Tested-by: Ruslan Kabatsayev <b7.10110111@gmail.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-11-21 10:13:07 -08:00
Tapani Pälli 6236ffeb83 intel: fix disasm_info memory leaks
Fixes: 4f82b17287 ("i965: Rewrite disassembly annotation code")
Cc: Matt Turner <mattst88@gmail.com>
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-11-21 08:36:43 +02:00
Jason Ekstrand 1eab327ba7 i965: Stop including brw_cfg.h in brw_disasm_info.h
The brw_disasm_info header is included by certain tools in order to get
shader assembly from binaries so it's a semi-external header.  Including
brw_cfg.h also pulls in brw_shader.h so you end up getting quite a bit
of our back-end compiler internals.  Instead, make the couple of forward
declarations we need and make the header more stand-alone.  This fixes
the meson build.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 4f82b17287
2017-11-17 21:51:16 -08:00
Andres Gomez 1866f7aee5 i965: Correct disasm_info usage in eu_validate test
Fixes: 4f82b17287 ("i965: Rewrite disassembly annotation code")

Cc: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-11-18 03:07:06 +02:00
Matt Turner 821ec473a8 i965: Rename intel_asm_annotation -> brw_disasm_info
It was the only file named intel_* in the compiler.

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-17 12:14:38 -08:00
Matt Turner 4f82b17287 i965: Rewrite disassembly annotation code
The old code used an array to store each "instruction group" (the new,
better name than the old overloaded "annotation"), and required a
memmove() to shift elements over in the array when we needed to split a
group so that we could add an error message. This was confusing and
difficult to get right, not the least of which was  because the array
has a tail sentinel not included in .ann_count.

Instead use a linked list, a data structure made for efficient
insertion.

Acked-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-17 12:14:38 -08:00
Matt Turner f80e97346b i965: Simplify annotation_insert_error()
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-17 12:14:38 -08:00
Matt Turner f4276ef7ef i965: Move common code out of #ifdef
I'm going to change the call in a later patch and with the difference in
indentation level it wasn't immediately obvious that the calls were
identical.

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-17 12:14:38 -08:00
Kenneth Graunke e48cc01be9 intel: Drop mtypes.h include from brw_compiler.h.
This isn't necessary and causes trouble for a project I'm working on.
2017-11-15 09:37:32 -08:00
Kenneth Graunke ff964916dc i965: Use nir_lower_atomics_to_ssbos and delete ABO compiler code.
We use the same hardware mechanism for both atomic counters and SSBO
atomics, so there's really no benefit to maintaining separate code to
handle each case.  Instead, we can just use Rob's shiny new NIR pass to
convert atomic_uints to SSBOs, and delete piles of code.

The ssbo_start section of the binding table becomes a combined ABO and
SSBO section, with ABOs first, then SSBOs.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-11-15 09:37:32 -08:00
Matt Turner a31d038208 Revert "intel/fs: Use a pure vertical stride for large register strides"
This reverts commit e8c9e65185.

With the actual bug fixed (by commit 6ac2d16901), this is not
necessary. I'm doubtful of its correctness in any case.
2017-11-14 11:24:08 -08:00
Matt Turner 6ac2d16901 i965/fs: Fix extract_i8/u8 to a 64-bit destination
The MOV instruction can extract bytes to words/double words, and
words/double words to quadwords, but not byte to quadwords.

For unsigned byte to quadword, we can read them as words and AND off the
high byte and extract to quadword in one instruction. For signed bytes,
we need to first sign extend to word and the sign extend that word to a
quadword.

Fixes the following test on CHV, BXT, and GLK:
   KHR-GL46.shader_ballot_tests.ShaderBallotBitmasks
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103628
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-11-14 10:56:18 -08:00
Matt Turner cfcfa0b9cd i965/fs: Split all 32->64-bit MOVs on CHV, BXT, GLK
Fixes the following tests on CHV, BXT, and GLK:
    KHR-GL46.shader_ballot_tests.ShaderBallotFunctionBallot
    dEQP-VK.spirv_assembly.instruction.compute.uconvert.uint32_to_int64
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103115
2017-11-14 10:56:18 -08:00
Jason Ekstrand 951a5dc4cc intel/nir: Use the correct indirect lowering masks in link_shaders
Previously, if we were linking a vec4 VS with a SIMD8/16 FS, we wouldn't
lower indirects on the fragment shader which is wrong.  Instead of using
a single indirect mask, take advantage of our new little helper.

Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-08 20:10:04 -08:00
Jason Ekstrand 3e63cf893f intel/nir: Break the linking code into a helper in brw_nir.c
Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-08 14:09:51 -08:00
Jason Ekstrand 7364f080f9 intel/nir: Add a helper for getting the NoIndirect mask
Reviewed-by: Timothy Arceri <tarceri at itsqueeze.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-08 14:09:49 -08:00
Jason Ekstrand d002950e54 intel/fs/nir: Return Q types from brw_reg_type_for_bit_size
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-11-07 10:41:24 -08:00
Jason Ekstrand dee58ecd2e intel/fs/nir: Use Q immediates for load_const on gen8+
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-11-07 10:41:24 -08:00
Jason Ekstrand 9bb34892bf intel/fs/nir: Setup immediates based on type in i2b and f2b
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-11-07 10:41:24 -08:00
Jason Ekstrand 1cb210f4bc intel/reg: Add helpers for 64-bit integer immediates
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-11-07 10:41:24 -08:00
Jason Ekstrand ab9220edd6 nir,intel/compiler: Use a fixed subgroup size
The GL_ARB_shader_ballot spec says that gl_SubGroupSizeARB is declared
as a uniform.  This means that it cannot change across an invocation
such as a draw call or a compute dispatch.  For compute shaders, we're
ok because we only ever use one dispatch size.  For fragment, however,
the hardware dynamically chooses between SIMD8 and SIMD16 which violates
the spec.  Instead, let's just pick a subgroup size based on the shader
stage.  The fixed size we choose for compute shaders is a bit higher
than strictly needed but there's no real harm in that.  The advantage is
that, if they do anything interesting with the value, NIR will see it as
an immediate and can optimize better.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand a026458020 nir/lower_subgroups: Lower ballot intrinsics to the specified bit size
Ballot intrinsics return a bitfield of subgroups.  In GLSL and some
SPIR-V extensions, they return a uint64_t.  In SPV_KHR_shader_ballot,
they return a uvec4.  Also, some back-ends would rather pass around
32-bit values because it's easier than messing with 64-bit all the time.
To solve this mess, we make nir_lower_subgroups take a new parameter
called ballot_bit_size and it lowers whichever thing it gets in from the
source language (uint64_t or uvec4) to a scalar with the specified
number of bits.  This replaces a chunk of the old lowering code.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 28da82f978 nir: Add a new subgroups lowering pass
This commit pulls nir_lower_read_invocations_to_scalar along with most
of the guts of nir_opt_intrinsics (which mostly does subgroup lowering)
into a new nir_lower_subgroups pass.  There are various other bits of
subgroup lowering that we're going to want to do so it makes a bit more
sense to keep it all together in one pass.  We also move it in i965 to
happen after nir_lower_system_values to ensure that because we want to
handle the subgroup mask system value intrinsics here.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 1ca3a94427 intel/fs: Don't use automatic exec size inference
The automatic exec size inference can accidentally mess things up if
we're not careful.  For instance, if we have

add(4)    g38.2<4>D    g38.1<8,2,4>D    g38.2<8,2,4>D

then the destination register will end up having a width of 2 with a
horizontal stride of 4 and a vertical stride of 8.  The EU emit code
sees the width of 2 and decides that we really wanted an exec size of 2
which doesn't do what we wanted.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand dc4cf11dfc intel/fs: Explicitly set EXECUTE_1 where needed
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand ab378734f5 intel/eu: Explicitly set EXECUTE_1 where needed
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 8280560705 intel/eu: Make automatic exec sizes a configurable option
We have had a feature in codegen for some time that tries to
automatically infer the execution size of an instruction from the width
of its destination.  For things such as fixed function GS, clipper, and
SF programs, this is very useful because they tend to have lots of
hand-rolled register setup and trying to specify the exec size all the
time would be prohibitive.  For things that come from a higher-level IR,
however, it's easier to just set the right size all the time and the
automatic exec sizes can, in fact, cause problems.  This commit makes it
optional while enabling it by default.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 7a82ad54bb intel/fs: Rework zero-length URB write handling
Originally we tried to handle this case based on slots_valid.  However,
there are a number of ways that this can go wrong.  For one, we throw
away any trailing slots which either aren't written or are set to
VARYING_SLOT_PAD.  Second, even if PSIZ is a valid slot, we may not
actually write anything there.  Between the lot of these, it was
possible to end up in a case where we tried to do a regular URB write
but ended up with a length of 1 which is invalid.  This commit moves it
to the end and makes it based on a new boolean flag urb_written.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 6132992cdb intel/compiler/fs: Set up subgroup invocation as a system value
Subgroup invocation is computed using a vector immediate and some
dispatch-aware arithmetic.  Unfortunately, due to the vector arithmetic,
and the fact that it's frequently read 16-wide, it's not something that
can easily be CSEd by the back-end compiler.  There are a few different
possible approaches to this problem:

 1) Emit the code to calculate the subgroup invocation on-the-fly and
    trust NIR to do the CSE.  This is what we were doing.

 2) Add a back-end instruction for the subgroup ID.  This has the
    advantage of helping the back-end compiler with CSE but has the
    downside of very poor scheduling for the calculation because it has
    to be emitted in the back-end.

 3) Emit the calculation at the top of the program and re-use the
    result.  This gets rid of the CSE problem but comes at the cost of
    an extra live register.

This commit switches us from 1) to 3).  We choose to store the subgroup
invocation values as a W type to reduce the impact of the extra live
register.  Trusting NIR and using 1) was fine but we're soon going to
want to use the subgroup invocation value for other things in the
back-end compiler and this makes it much easier to do without having to
worry about CSE problems.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 295605c930 intel/cs: Push subgroup ID instead of base thread ID
We're going to want subgroup ID for SPIR-V subgroups eventually anyway.
We really only want to push one and calculate the other from it.  It
makes a bit more sense to push the subgroup ID because it's simpler to
calculate and because it's a real API thing.  The only advantage to
pushing the base thread ID is to avoid a single SHL in the shader.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 6411defdcd intel/cs: Re-run final NIR optimizations for each SIMD size
With the advent of SPIR-V subgroup operations, compute shaders will have
to be slightly different depending on the SIMD size at which they
execute.  In order to allow us to do dispatch-width specific things in
NIR, we re-run the final NIR stages for each sIMD width.

One side-effect of this change is that we start rallocing fs_visitors
which means we need DECLARE_RALLOC_CXX_OPERATORS.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 4e79a77cdc intel/compiler: Move the destructor from vec4_visitor to backend_shader
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 16ada419d7 i965/fs: Get rid of the early return in brw_compile_cs
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 80ddfab2f5 intel/cs: Rework the way thread local ID is handled
Previously, brw_nir_lower_intrinsics added the param and then emitted a
load_uniform intrinsic to load it directly.  This commit switches things
over to use a specific NIR intrinsic for the thread id.  The one thing I
don't like about this approach is that we have to copy thread_local_id
over to the new visitor in import_uniforms.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 25f7453c9e intel/fs: Mark 64-bit values as being contiguous
This isn't often a problem , when we're in a compute shader, we must
push the thread local ID so we decrement the amount of available push
space by 1 and it's no longer even and 64-bit data can, in theory, span
it.  By marking those uniforms contiguous, we ensure that they never get
split in half between push and pull constants.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand c4c8cba705 intel/cs: Ignore runtime_check_aads_emit for CS
It's only set on gen4-5 which clearly don't support compute shaders.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand d4de813d86 intel/cs: Stop setting dispatch_grf_start_reg
Nothing ever reads it for compute shaders because it's always 1.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand b1a9cdede4 intel/cs: Drop max_dispatch_width checks from compile_cs
The only things that adjust fs_visitor::max_dispatch_width are render
target writes which don't happen in compute shaders so they're
pointless.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 1077981eb5 intel/fs: Remove min_dispatch_width from fs_visitor
It's 8 for everything except compute shaders.  For compute shaders,
there's no need to duplicate the computation and it's just a possible
source of error.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand b299ded02e intel/fs: use pull constant locations to check for first compile of a shader
Before, we bailing in assign_constant_locations based on the minimum
dispatch size.  The more direct thing to do is simply to check for
whether or not we have constant locations and bail if we do.  For
nir_setup_uniforms, it's completely safe to do it multiple times because
we just copy a value from the NIR shader.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 103081c9a9 intel/fs: Retype dest to match value in read[First]Invocation
This is what we really wanted all along.  Always retyping to D works
because that's what get_nir_src() always gives us, at least for 32-bit
types.  The SPIR-V variants of these operations accept arbitrary types
and we need this if we're going to handle 64 or 16-bit values.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand ebaee9da4a intel/fs: Uniformize the index in readInvocation
The index is any value provided by the shader and this can be called in
non-uniform control flow so we can't just take component 0.  Found by
inspection.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand b67230de63 intel/fs: Protect opt_algebraic from OOB BROADCAST indices
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand aa4ff4b98c i965/fs/nir: Don't stomp 64-bit values to D in get_nir_src
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand ec8c6649f1 i965/fs/nir: Minor refactor of store_output
Stop retyping the output of shuffle_64bit_data_for_32bit_write.  It's
always BRW_REGISTER_TYPE_D which is perfectly fine for writing out.
Also, when we change get_nir_src to return something with a 64-bit type
for 64-bit values, the retyping will not be at all what we want.  Also,
retyping the output based on src.type before we whack it back to 32 bits
is a problem because the output is always 32 bits.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 030d2b5016 i965/fs: Return a fs_reg from shuffle_64bit_data_for_32bit_write
All callers of this function allocate a fs_reg expressly to pass into
it.  It's much easier if we just let the helper allocate the register.
While we're here, we switch it to doing the MOVs with an integer type so
that we don't accidentally canonicalize floats on half of a double.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 6197a6b7ac i965/fs/nir: Simplify 64-bit store_output
The swizzles weren't doing any good because swiz is just XYZW.  Also, we
were emitting an extra set of MOVs because shuffle_64bit_data_for_32bit
already does a MOV for us.  Finally, the temporary was only ever used
inside the inner loop so there's no need for it to actually be an array.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 18fde36ced intel/fs: Use the original destination region for int MUL lowering
Some hardware (CHV, BXT) have special restrictions on register regions
when doing integer multiplication.  We want to respect those when we
lower to DxW multiplication.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand d54f8ec744 intel/fs: Fix integer multiplication lowering for src/dst hazards
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand fd1bcccc2d intel/fs: Fix MOV_INDIRECT for 64-bit values on little-core
The same workaround we need for 64-bit values on little core also takes
care of the Ivy Bridge problem and does so a bit more efficiently so we
can drop that code while we're here.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 6041a31e77 intel/eu: Fix broadcast instruction for 64-bit values on little-core
We're not using broadcast for any 32-bit types right now since we mostly
use it for emit_uniformize on 32-bit buffer indices.  However, SPIR-V
subgroups are going to need it for 64-bit so let's make it work.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 10e4feed39 intel/eu/reg: Add a subscript() helper
This is similar to the identically named fs_reg helper.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 068beb41d8 intel/eu: Just modify the offset in brw_broadcast
This means we have to drop const from a variable but it also means that
100% of the code which deals with the offset limit is in one place.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand e3bcc86133 intel/compiler: Add some restrictions to MOV_INDIRECT and BROADCAST
These restrictions effectively already existed due to the way we use
indirect sources but weren't being directly enforced.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-11-07 10:37:52 -08:00
Jason Ekstrand 1b8ef49f48 intel/fs: Use a pair of 1-wide MOVs instead of SEL for any/all
For some reason, the any/all predicates don't work properly with SIMD32.
In particular, it appears that a SEL with a QtrCtrl of 2H doesn't read
the correct subset of the flag register and you end up getting garbage
in the second half.  Work around this by using a pair of 1-wide MOVs and
scattering the result.  This fixes the any/all instructions for SIMD32.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 1f41663007 intel/fs: Use an explicit D type for vote any/all/eq intrinsics
The any/all intrinsics return a boolean value so D or UD is the correct
type.  Unfortunately, get_nir_dest has the annoying behavior of
returnning a float type by default.  This causes format conversion which
gives us -1.0f or 0.0f in the register.  If the consumer of the result
does an integer comparison to zero, it will give you the right boolean
value but if we do something more clever based on the 0/~0 assumption
for booleans, this will give the wrong value.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 6c00240bc6 intel/fs: Don't stomp f0.1 in SIMD16 ballot
In fragment shaders f0.1 is used for discards so doing ballot after a
discard can potentially cause the discard to not happen.  However, we
don't support SIMD32 fragment shaders yet so this isn't a problem.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand def013a863 intel/fs: Use ANY/ALL32 predicates in SIMD32
We have ANY/ALL32 predicates and, for the most part, they work just
fine.  (See the next commit for more details.)  Also, due to the way
that flag registers are handled in hardware, instruction splitting is
able to split the CMP correctly.  Specifically, that hardware looks at
the execution group and knows to shift it's flag usage up correctly so a
2H instruction will write to f0.1 instead of f0.0.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 0d905597fe intel/fs: Be more explicit about our placement of [un]zip
Before, we were careful to place the zip after the last of the split
instructions but did unzip on-demand.  This changes things so that the
unzips go before all of the split instructions and the unzip comes
explicitly after all the split instructions.  As a side-effect of this
change, we now emit the split instruction from highest SIMD group to
lowest instead of low to high.  We could have kept the old behavior, but
it shouldn't matter and this made the code easier.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand fcd4adb9d0 intel/fs: Pass builders instead of blocks into emit_[un]zip
This makes it far more explicit where we're inserting the instructions
rather than the magic "before and after" stuff that the emit_[un]zip
helpers did based on block and inst.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand e8c9e65185 intel/fs: Use a pure vertical stride for large register strides
Register strides higher than 4 are uncommon but they can happen.  For
instance, if you have a 64-bit extract_u8 operation, we turn that into
UB -> UQ MOV with a source stride of 8.  Our previous calculation would
try to generate a stride of <32;8,8>:ub which is invalid because the
maximum horizontal stride is 4.  To solve this problem, we instead use a
stride of <8;1,0>.  As noted in the comment, this does not work as a
destination but that's ok as very few things actually generate that
stride.

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-07 10:37:52 -08:00
Jason Ekstrand 172e8e42c4 intel/fs: Don't allocate a param array for zero push constants
Thanks to the ralloc invariant of "any pointer returned from ralloc can
be used as a context", calling ralloc_size with a size of zero will
cause it to allocate at least a header.  If we don't have any push
constants, then NULL is perfectly acceptable (and even preferred).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
2017-11-02 09:55:21 -07:00
Jason Ekstrand 7b4387519c intel/fs: Alloc pull constants off mem_ctx
It doesn't actually matter since the only user of push constants, i965,
ralloc_steals it back to NULL but it's more consistent and probably
fixes memory leaks in some error cases.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Cc: mesa-stable@lists.freedesktop.org
2017-11-02 09:55:21 -07:00
Jordan Justen f082d7f64f intel/compiler: Add functions to get prog_data and prog_key sizes for a stage
v2:
 * Return unsigned instead of size_t. (Ken)

Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-31 23:36:54 -07:00
Jordan Justen 05b1193361 intel/compiler: Add union types for prog_data and prog_key stages
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-31 23:36:54 -07:00
Jordan Justen 3dcbc5cdaa intel/compiler: Remove final_program_size from brw_compile_*
The caller can now use brw_stage_prog_data::program_size which is set
by the brw_compile_* functions.

Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-31 23:36:54 -07:00
Carl Worth 540636045f intel/compiler: add new field for storing program size
This will be used by the on disk shader cache.

v2:
 * Set in brw_compile_* rather than brw_codegen_*. (Jason)

Signed-off-by: Timothy Arceri <timothy.arceri@collabora.com>
[jordan.l.justen@intel.com: Only add to brw_stage_prog_data]
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-31 23:36:54 -07:00
Eric Engestrom ceaad79f85 i965: remove unused variable
Fixes: 2c873060d3 "i965: Delete unused
       brw_vs_prog_data::nr_attributes field."
Cc: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
2017-10-30 16:32:05 +00:00
Ian Romanick 6403efbe74 glsl: Remove ir_binop_greater and ir_binop_lequal expressions
NIR does not have these instructions.  TGSI and Mesa IR both implement
them using < and >=, repsectively.  Removing them deletes a bunch of
code and means I don't have to add code to the SPIR-V generator for
them.

v2: Rebase on 2+ years of change... and fix a major bug added in the
rebase.

   text	   data	    bss	    dec	    hex	filename
8255291	 268856	 294072	8818219	 868e2b	32-bit i965_dri.so before
8254235	 268856	 294072	8817163	 868a0b	32-bit i965_dri.so after
7815339	 345592	 420592	8581523	 82f193	64-bit i965_dri.so before
7813995	 345560	 420592	8580147	 82ec33	64-bit i965_dri.so after

Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-10-30 09:27:09 -07:00
Tapani Pälli 446c5726ec i965: fix blorp stage_prog_data->param leak
Patch uses mem_ctx for allocation to ensure param array gets freed
later.

==6164== 48 bytes in 1 blocks are definitely lost in loss record 61 of 193
==6164==    at 0x4C2EB6B: malloc (vg_replace_malloc.c:299)
==6164==    by 0x12E31C6C: ralloc_size (ralloc.c:121)
==6164==    by 0x130189F1: fs_visitor::assign_constant_locations() (brw_fs.cpp:2095)
==6164==    by 0x13022D32: fs_visitor::optimize() (brw_fs.cpp:5715)
==6164==    by 0x13024D5A: fs_visitor::run_fs(bool, bool) (brw_fs.cpp:6229)
==6164==    by 0x1302549A: brw_compile_fs (brw_fs.cpp:6570)
==6164==    by 0x130C4B07: blorp_compile_fs (blorp.c:194)
==6164==    by 0x130D384B: blorp_params_get_clear_kernel (blorp_clear.c:79)
==6164==    by 0x130D3C56: blorp_fast_clear (blorp_clear.c:332)
==6164==    by 0x12EFA439: do_single_blorp_clear (brw_blorp.c:1261)
==6164==    by 0x12EFC4AF: brw_blorp_clear_color (brw_blorp.c:1326)
==6164==    by 0x12EFF72B: brw_clear (brw_clear.c:297)

Fixes: 8d90e28839 ("intel/compiler: Allocate pull_param in assign_constant_locations")
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: mesa-stable@lists.freedesktop.org
2017-10-30 08:19:37 +02:00
Kenneth Graunke d1b392d060 i965: Delete brw_wm_prog_key::drawable_height.
This has been unused since we switched to nir_lower_wpos_ytransform.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-29 20:52:02 -07:00
Topi Pohjolainen 97e01adfd5 intel/compiler/gen9: Pixel shader header only workaround
Fixes intermittent GPU hangs on Broxton with an Intel internal
test case.

There are plenty of similar fragment shaders in piglit that do
not use any varyings and any uniforms. According to the
documentation special timing is needed between pipeline stages.
Apparently we just don't hit that with piglit. Even with the
failing test case one doesn't always get the hang.

Moreover, according to the error states the hang happens
significantly later than the execution of the problematic shader.
There are multiple render cycles (primitive submissions) in between.
I've also seen error states where the ACTHD points outside the
batch. Almost as if the hardware writes somewhere that gets used
later on. That would also explain why piglit doesn't suffer from
this - most tests kick off one render cycle and any corruption
is left unseen.

v2 (Ken): Instead of enabling push constants, enable one of the
          inputs (PSIZ).
v3 (Ken, Jason): Use LAYER instead making vulkan emit_3dstate_sbe()
                 happy.

Cc: "17.3 17.2" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-10-28 10:07:29 +03:00
Kenneth Graunke 2c873060d3 i965: Delete unused brw_vs_prog_data::nr_attributes field.
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-10-27 02:53:38 -07:00
Kevin Rogovin 75d10e4c84 intel/compiler: brw_validate_instructions to take const void* instead of void*
The disassembler does not (and should not) be modifying the data.

Signed-off-by: Kevin Rogovin <kevin.rogovin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-26 10:43:48 -07:00
Jason Ekstrand d24311b7b5 intel/compiler: Call nir_lower_system_values in brw_preprocess_nir
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-10-25 16:14:09 -07:00
Jason Ekstrand 562b8d458c intel/eu: Use EXECUTE_1 for JMPI
The PRM says "The execution size must be 1."  In 73137997e2, the
execution size was set to 1 when it should have been BRW_EXECUTE_1
(which maps to 0).  Later, in dc2d3a7f5c, JMPI was used for
line AA on gen6 and earlier and we started manually stomping the
exeution size to BRW_EXECUTE_1 in the generator.  This commit fixes the
original bug and makes brw_JMPI just do the right thing.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Fixes: 73137997e2
2017-10-25 16:14:09 -07:00
Alejandro Piñeiro 4723933b8e i965/fs: Add brw_reg_type_from_bit_size utility method
Returns the brw_type for a given ssa.bit_size, and a reference type.
So if bit_size is 64, and the reference type is BRW_REGISTER_TYPE_F,
it returns BRW_REGISTER_TYPE_DF. The same applies if bit_size is 32
and reference type is BRW_REGISTER_TYPE_HF it returns BRW_REGISTER_TYPE_F

v2 (Jason Ekstrand):
 - Use better unreachable() messages
 - Add Q types

Signed-off-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>
Signed-off-by: Alejandro Piñeiro <apinheiro@igalia.com
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-25 16:14:09 -07:00
Jason Ekstrand 99778e7f9f i965/fs/nir: Use the nir_src_bit_size helper
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-10-25 16:14:09 -07:00
Jason Ekstrand fa6e74e33e intel/fs: Handle flag read/write aliasing in needs_src_copy
In order to implement the ballot intrinsic, we do a MOV from flag
register to some GRF.  If that GRF is used in a SEL, cmod propagation
helpfully changes it into a MOV from the flag register with a cmod.
This is perfectly valid but when lower_simd_width comes along, it simply
splits into two instructions which both have conditional modifiers.
This is a problem since we're reading the flag register.  This commit
makes us check whether or not flags_written() overlaps with the flag
values that we are reading via the instruction source and, if we have
any interference, will force us to emit a copy of the source.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-10-25 16:14:09 -07:00
Jordan Justen b35e8c3b86 intel/nir: Zero local index const struct for valgrind & nir_serialize
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-10-25 12:36:21 -07:00
Rob Clark 2207af032b meson: extract out variable for nir_algebraic.py
Also needed in freedreno/ir3.

Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
2017-10-24 15:33:40 -04:00
Eric Anholt e91c3540fc i965: Fix memmem compiler warnings.
gcc is throwing this warning in my meson build:

../src/intel/compiler/brw_eu_validate.c:50:11: warning
argument 1 null where non-null expected [-Wnonnull]
    return memmem(haystack.str, haystack.len,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                  needle.str, needle.len) != NULL;
                                  ~~~~~~~~~~~~~~~~~~~~~~~

The first check for CONTAINS has a NULL error_msg.str and 0 len.  The
glibc implementation will exit without looking at any haystack bytes if
haystack.len < needle.len, so this was safe, but silence the warning
anyway by guarding against implementation variablility.

Fixes: 122ef3799d ("i965: Only insert error message if not already present")
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-10-24 10:51:18 -07:00
Matt Turner 9cd60fce9c i965/fs: Use align1 mode on ternary instructions on Gen10+
Align1 mode offers some nice features over align16, like access to more
data types and the ability to use a 16-bit immediate. This patch does
not start using any new features. It just emits ternary instructions in
align1 mode.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner 8c16c9c677 i965: Add align1 ternary instruction emission support
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner f11fa5ac6c i965: Add align1 ternary instruction disassembler support
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner 6c7fc9b73a i965: Add align1 ternary instruction-word support
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner 3b2c868848 i965: Add align1 ternary instruction support to conversion functions
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner 281e8b8f27 i965: Add align1 ternary instruction field encodings
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner 5f6ee55e68 i965: Add functions to abstract access to 3src register types
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:17 -07:00
Matt Turner e15dac319b i965: Rename brw_inst's functions that access the 3src register type
Put hw_ in the name so that it's clear these are the hardware encodings.

Similar to commit 9fb8323328 ("i965: Rename brw_inst's functions that
access the register type")

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:16 -07:00
Matt Turner e7f3b82e03 i965: Rename brw_inst 3src functions in preparation for align1
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:16 -07:00
Matt Turner ba50b538af i965: Print subreg in units of type-size on ternary instructions
The instruction word contains SubRegNum[4:2] so it's in units of dwords
(hence the * 4 to get it in terms of bytes). Before this patch, the
subreg would have been wrong for DF arguments.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:16 -07:00
Matt Turner 3f14150e9a i965: Add functions for brw_reg_type <-> hw 3src type
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:16 -07:00
Matt Turner 4c857d1f3b i965: Move brw_reg_type_is_floating_point to brw_reg_type.h
I'm going to call this from brw_inst.h, and I don't want to have to
include all of brw_reg.h.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-20 15:00:16 -07:00
Jason Ekstrand 59fb59ad54 nir: Get rid of nir_shader::stage
It's redundant with nir_shader::info::stage.

Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-10-20 12:49:17 -07:00
Samuel Iglesias Gonsálvez 9e515cf381 i965/vec4: remove setting default LOD in the backend
It is already done in NIR.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-10-20 08:29:53 +02:00
Samuel Iglesias Gonsálvez c6d7d09bd0 i965/fs: remove setting default LOD in the backend
It is already done in NIR.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-10-20 08:29:53 +02:00
Kenneth Graunke 68f69ebdcc i965: Use is_scheduling_barrier instead of schedule_node::is_barrier.
Commit a73116ecc6 tried to make add_barrier_deps()
walk to the next barrier, and stop.  To accomplish that, it added an
is_barrier flag.  Unfortunately, this only works half of the time.

The issue is that add_barrier_deps() walks both backward (to the
previous barrier), and forward (to the next barrier).  It also sets
is_barrier.  Assuming that we're processing instructions in forward
order, this means that is_barrier will be set for previous instructions,
but not future ones.  So we'll never see it, and walk further than we
need to.

dEQP-GLES31.functional.ssbo.layout.random.all_shared_buffer.23
now compiles its shaders in 3.6 seconds instead of 3.3 minutes.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Pallavi G <pallavi.g@intel.com>
2017-10-19 10:19:20 -07:00
Kenneth Graunke 3d112a7cd4 i965: Move fs_inst::has_side_effects()'s eot check to the parent class.
This eliminates a layer of wrapping, and makes a backend_instruction
sufficient.  The downside is that it exposes 'eot' to the vec4 backend,
which it doesn't need, but can basically happily ignore.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Tested-by: Pallavi G <pallavi.g@intel.com>
2017-10-19 10:19:20 -07:00
Jason Ekstrand 79d403417c intel/cs: Make thread_local_id a regular builtin param
This is a lot more natural than special casing it all over the place.
We still have to do a bit of special-casing in assign_constant_locations
but it's not special-cased quite as bad as it was before.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:31 -07:00
Jason Ekstrand 8d90e28839 intel/compiler: Allocate pull_param in assign_constant_locations
Now that everything is nicely ralloc'd, we can allocate the pull_param
array in assign_constant_locations instead of higher up.  We can also
re-allocate the param array so that it's exactly the needed size.  This
should save us some memory because we're not allocating the total needed
param space for both push and pull.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:31 -07:00
Jason Ekstrand 29737eac98 intel: Allocate prog_data::[pull_]param deeper inside the compiler
Now that we're always growing the param array as-needed, we can
allocate the param array in common code and stop repeating the
allocation everywere.  In order to keep things sane, we ralloc the
[pull_]param array off of the compile context and then steal it back
to a NULL context later.  This doesn't get us all the way to where
prog_data::[pull_]param is purely an out parameter of the back-end
compiler but it gets us a lot closer.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:31 -07:00
Jason Ekstrand 4dfb8b3416 intel/vs: Grow the param array for clip planes
Instead of requiring the caller of brw_compile_vs to figure it out, just
grow the param array on-demand.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand 6bcc5c0c75 intel/cs: Grow prog_data::param on-demand for thread_local_id_index
Instead of making the caller of brw_compile_cs add something to the
param array for thread_local_id_index, just add it on-demand in
brw_nir_intrinsics and grow the array.  This is now safe to do because
everyone is now using ralloc for prog_data::param.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand b1d1b7222a intel/compiler: Make brw_nir_lower_intrinsics compute-specific
It's already only ever called from brw_compile_cs and only handles
compute intrinsics.  Let's just make it CS-specific.  We can always
make it handle other stages again later if we want.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand 2db9470d88 intel/compiler: Add a helper for growing the prog_data::param array
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand 4efd079aba intel/compiler: Add a flag for pull constant support
The Vulkan driver does not support pull constants.  It simply limits
things such that we can always push everything.  Previously, we were
determining whether or not to push things based on whether or not the
prog_data::pull_param array is non-null.  This is rather hackish and
about to stop working.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand cfc7ed75eb i965: Store image_param in brw_context instead of prog_data
This burns an extra 10k of memory or so in the case where you don't have
any images.  However, if you have several shaders which use images, this
should be much less memory.  It also gets rid of a part of prog_data
that really has nothing to do with the compiler.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:30 -07:00
Jason Ekstrand 2975e4c56a intel: Rewrite the world of push/pull params
This moves us away to the array of pointers model and onto a model where
each param is represented by a generic uint32_t handle.  We reserve 2^16
of these handles for builtins that get generated by somewhere inside the
compiler and have well-defined meanings.  Generic params have handles
whose meanings are defined by the driver.

The primary downside to this new approach is that it moves a little bit
of the work that we would normally do at compile time to draw time.  On
my laptop this hurts OglBatch6 by no more than 1% and doesn't seem to
have any measurable affect on OglBatch7.  So, while this may come back
to bite us, it doesn't look too bad.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-12 22:39:29 -07:00
Kenneth Graunke 6f5abf3146 i965: Fix output register sizes when multiple variables share a slot.
ARB_enhanced_layouts allows multiple output variables to share the same
location - and these variables may not have the same sizes.  For
example, consider these output variables:

   // consume X/Y/Z components of 6 vectors
   layout(location = 0) out vec3 a[6];

   // consumes W component of the first vector
   layout(location = 0, component = 3) out float b;

Looking at the first declaration, we see that VARYING_SLOT_VAR0 needs 24
components worth of space (vec3 padded out to a vec4, 4 * 6 = 24).  But
looking at the second declaration, we would think that VARYING_SLOT_VAR0
needs only 4 components of space (a single float padded out to a vec4).

nir_setup_outputs() only considered the space requirements of the first
declaration it happened to see, so if 'float b' came first, it would
underallocate the output register space, causing brw_fs_validator.cpp
to assert fail about inst->dst.offset exceeding the register size.

Fixes Piglit's tests/spec/arb_enhanced_layouts/execution/component-layout/
vs-to-fs-array-interleave-single-location.shader_test.

Thanks to Tim Arceri for finding this bug and writing a test!

Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
2017-10-10 17:29:37 -07:00
Kenneth Graunke 03087686ff i965: Don't try to decode types for non-existent src1.
KHR-GL45.shader_ballot_tests.ShaderBallotBitmasks has a MOV that hits
this validation path.  MOVs don't have a src1 file, but calling
brw_inst_src1_type() was tripping on src1.file being BRW_IMMEDIATE_VALUE
and the hw_type being something invalid for immediates.

To work around this, just pretend src1 is src0 if there isn't a src1.

Fixes: 2572c2771d (i965: Validate "Special
       Requirements for Handling Double Precision Data Types")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102680
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
2017-10-10 15:11:35 -07:00
Iago Toral Quiroga 5ec21eb1a0 i965/tes: account for the fact that dvec3/4 inputs take two slots
When computing the total size of the URB for tessellation evaluation
inputs we were not accounting for this, and instead we were always
assuming that each input would take a single vec4 slot, which could
lead to computing a smaller read size than required. Specifically, this
is a problem when the last input is a dvec3/4 such that its XY components
are stored in the the second half of a payload register (which can happen
if the offset for the input in the URB is not 64-bit aligned because
there are 32-bit inputs mixed in) and the ZW components in the
first half of the next, as in this case we would fail to account for the
extra slot required for the ZW components.

Fixes (requires another fix in CTS currently in review):
KHR-GL45.enhanced_layouts.varying_locations
KHR-GL45.enhanced_layouts.varying_array_locations

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-10 08:59:54 +02:00
Jason Ekstrand 7463d50580 intel/compiler: Don't propagate cmod into integer multiplies
No shader-db change on Sky Lake.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-10-05 11:54:49 -07:00
Jason Ekstrand b91ecee04a intel/compiler: Don't cmod propagate into a saturated operation
Shader-db results on Sky Lake:

    total instructions in shared programs: 12954445 -> 12955125 (0.01%)
    instructions in affected programs: 141862 -> 142542 (0.48%)
    helped: 0
    HURT: 626

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-10-05 11:54:49 -07:00
Matt Turner 2572c2771d i965: Validate "Special Requirements for Handling Double Precision Data Types"
I did not implement:

   CNL's restriction on 64-bit int + align16, because I don't think
   we'll ever use this combination regardless of hardware generation.

   The restriction on immediate DF -> F conversions, because there's no
   reason to ever generate that, and I don't even know how DF -> F
   conversions are supposed to work in Align16 since (1) the dst stride
   must be 1, but (2) the dst stride would have to be 2 for src and dst
   strides to be aligned.
2017-10-04 14:08:54 -07:00
Matt Turner 98298c7e3d i965: Fix and enable forgotten validation test
I seem to have forgotten I still had work to do.
2017-10-04 14:08:54 -07:00
Matt Turner 122ef3799d i965: Only insert error message if not already present
Some restrictions require something like strides to match between src
and dest. For multi-source instructions, I'd rather encapsulate the
logic for not inserting already present errors in ERROR_IF than
open-coding it multiple places.
2017-10-04 14:08:54 -07:00
Matt Turner 5e76cf153c i965: Avoid validation error when src1 is not present
There can be no violation of the restriction that source offsets are
aligned if there is only one source offset.
2017-10-04 14:08:54 -07:00
Matt Turner cacc229ba0 i965: Remove validate_reg()
Replaced by the assembly validator, and in fact gets in the way of
writing tests for the assembly validator.
2017-10-04 14:08:54 -07:00
Matt Turner 678d88bcee i965: Add and use STRIDE and WIDTH macros
You'll notice there were bugs in some of the code being replaced.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-10-04 14:08:54 -07:00
Matt Turner 1fcdb1cbea i965: Add GLK, CFL, CNL to test_eu_validate.c 2017-10-04 14:08:54 -07:00
Matt Turner 6db5ec7deb i965: Fix support for disassembling 64-bit integer immediates
The type suffixes were wrong, and the 16 was missing the 0 prefix.

Fixes: 92f787ff86 ("i965: Add support for disassembling 64-bit integer immediates")
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-10-04 14:08:54 -07:00
Matt Turner 7e88f93469 i965/fs: Rewrite fsign64 to skip the float -> double conversion
... without the float -> double conversion. Low power parts have
additional restrictions when it comes to operating on 64-bit types, and
the instruction used to do the conversion violates one of them:
specifically, the restriction that "Source and Destination horizontal
stride must be aligned to the same qword".

Previously we generated a float and then converted, but we can avoid the
conversion by using the same extract-the-sign-bit + or-in-1.0 algorithm
by directly operating on the high four bytes of each double-precision
component in the result.

In SIMD8 and SIMD16 this cuts one instruction from the implementation,
and more importantly that instruction is the one which violated the
regioning restriction.

Along the way I removed some comments that I did not think helped, and
some code about double comparisons which does not seem to be necessary
today.

This prevents validation failures caught by the new EU validation code
added in later patches.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-10-04 14:08:54 -07:00
Matt Turner b541945c20 i965/fs: Unpack count argument to 64-bit shift ops on Atom
64-bit operations on Atom parts have additional restrictions over their
big-core counterparts (validated by later patches).

Specifically, the restriction that "Source and Destination horizontal
stride must be aligned to the same qword" is violated by most shift
operations since NIR uses a 32-bit value as the shift count argument,
and this causes instructions like

   shl(8)          g19<1>Q         g5<4,4,1>Q      g23<4,4,1>UD

where src1 has a 32-bit stride, but the dest and src0 have a 64-bit
stride.

This caused ~4 pixels in the ARB_shader_ballot piglit test
fs-readInvocation-uint.shader_test to be incorrect. Unfortunately no
ARB_gpu_shader_int64 test hit this case because they operate on
uniforms, and their scalar regions are an exception to the restriction.

We work around this by effectively unpacking the shift count, so that we
can read it with a 64-bit stride in the shift instruction. Unfortunately
the unpack (a MOV with a dst stride of 2) is a partial write, and cannot
be copy-propagated or CSE'd.

Bugzilla: https://bugs.freedesktop.org/101984
2017-10-04 14:08:54 -07:00
Matt Turner 2082c32950 i965/fs: Don't apply POW/FDIV workaround on Gen10+
The documentation says it applies only to Gens 8 and 9.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-10-04 14:08:37 -07:00
Matt Turner d407935327 i965: Fix src0 vs src1 typo
A typo caused us to copy src0's reg file to src1 rather than reading
src1's as intended. This caused us to fail to compact instructions like

   mov(8)   g4<1>D    0D              { align1 1Q };

because src1 was set to immediate rather than architecture file. Fixing
this reenables compaction (after the precompact() pass changes the data
types):

   mov(8)   g4<1>UD   0x00000000UD    { align1 1Q compacted };

Fixes: 1cb0a7941b ("i965: Switch to using the logical register types")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-04 14:08:24 -07:00
Lionel Landwerlin d3acc240d0 intel: compiler: vec4: add missing default 0 lod
We set a similar default value for LOD in the fs backend for TXS/TXL.
Without this we end up generating invalid MOV with a null src.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: "17.2 17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-10-03 22:50:46 +01:00
Dylan Baker 7a5a986ddd meson: convert gtest to an internal dependency
In truth gtest is an external dependency that upstream expects you to
"vendor" into your own tree. As such, it makes sense to treat it more
like a dependency than an internal library, and collect it's
requirements together in a dependency object.

v2: - include with -isystem instead of setting compiler args (Eric)

Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
2017-10-03 10:02:08 -07:00
Iago Toral Quiroga 5e584a9db7 i965: skip reading unused slots at the begining of the URB for the FS
We can start reading the URB at the first offset that contains varyings
that are actually read in the URB. We still need to make sure that we
read at least one varying to honor hardware requirements.

This helps alleviate a problem introduced with 99df02ca26 for
separate shader objects: without separate shader objects we assign
locations sequentially, however, since that commit we have changed the
method for SSO so that the VUE slot assigned depends on the number of
builtin slots plus the location assigned to the varying. This fixed
layout is intended to help SSO programs by avoiding on-the-fly recompiles
when swapping out shaders, however, it also means that if a varying uses
a large location number close to the maximum allowed by the SF/FS units
(31), then the offset introduced by the number of builtin slots can push
the location outside the range and trigger an assertion.

This problem is affecting at least the following CTS tests for
enhanced layouts:

KHR-GL45.enhanced_layouts.varying_array_components
KHR-GL45.enhanced_layouts.varying_array_locations
KHR-GL45.enhanced_layouts.varying_components
KHR-GL45.enhanced_layouts.varying_locations

which use SSO and the the location layout qualifier to select such
location numbers explicitly.

This change helps these tests because for SSO we always have to include
things such as VARYING_SLOT_CLIP_DIST{0,1} even if the fragment shader is
very unlikely to read them, so by doing this we free builtin slots from
the fixed VUE layout and we avoid the tests to crash in this scenario.

Of course, this is not a proper fix, we'd still run into problems if someone
tries to use an explicit max location and read gl_ViewportIndex, gl_LayerID or
gl_CullDistancein in the FS, but that would be a much less common bug and we
can probably wait to see if anyone actually runs into that situation in a real
world scenario before making the decision that more aggresive changes are
required to support this without reverting 99df02ca26.

v2:
- Add a debug message when we skip clip distances (Ilia)
- we also need to account for this when we compute the urb setup
  for the fragment shader stage, so add a compiler util to compute
  the first slot that we need to read from the URB instead of
  replicating the logic in both places.

v3:
- Make the util more generic so it can account for all unused slots
  at the beginning of the URB, that will make it more useful (Ken).
- Drop the debug message, it was not what Ilia was asking for.

Suggested-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-10-02 08:27:13 +02:00
Matt Turner 3cfd6ad01c i965: Normalize types for FBL, FBH, etc
Allows the instructions to be compacted. The documentation claims that
some of these only accept UD types, even though the type doesn't change
the operation performed. Just normalize the types to ensure we get
instruction compaction.

The only functional changes are for FBL and CBIT (always use UD types)
and FBH (always use the same types).

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-09-30 20:18:09 -07:00
Iago Toral Quiroga 47e527bd81 i965/fs: force pull model for 64-bit GS inputs
Triggering the push model when 64-bit inputs are involved is not easy due to
the constrains on the maximum number of registers that we allow for this mode,
however, for GS with 'points' primitive type and just a couple of double
varyings we can trigger this and it just doesn't work because the
implementation is not 64-bit aware at all. For now, let's make sure that we
don't attempt this model whith 64-bit inputs and we always fall back to pull
model for them.

Also, don't enable the VUE handles in the thread payload on the fly when we
find an input for which we need the pull model, this is not safe: if we need
to resort to the pull model we need to account for that when we setup the
thread payload so we compute the first non-payload register properly. If we
didn't do that correctly and we enable it on-the-fly here then we will end up
VUE handles on the first non-payload register which will probably lead to
GPU hangs. Instead, always enable the VUE handles for the pull model so we
can safely use them when needed. The GS is going to resort to pull model
almost in every situation anyway, so this shouldn't make a significant
difference and it makes things easier and safer.

v2: Always enable the VUE handles for pull model, this is easier and safer
    and the GS is going to fallback to pull model almost always anyway (Ken)

v3: Only clamp the URB read length if we are over the maximum reserved for
    push inputs as we were doing in the original code (Ken).

v4: No need to clamp the urb read length if invocations > 1

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-09-29 08:18:25 +02:00
Dylan Baker d1992255bb meson: Add build Intel "anv" vulkan driver
This allows building and installing the Intel "anv" Vulkan driver using
meson and ninja, the driver has been tested against the CTS and has
seems to pass the same series of tests (they both segfault when the CTS
tries to run wayland wsi tests).

There are still a mess of TODO, XXX, and FIXME comments in here. Those
are mostly for meson bugs I'm trying to fix, or for additional things to
implement for other drivers/features.

I have configured all intermediate libraries and optional tools to not
build by default, meaning they will only be built if they're pulled in
as a dependency of a target that will actually be installed) this allows
us to avoid massive if chains, while ensuring that only the bits that
need to be built are.

v2: - enable anv, x11, and wayland by default
    - add configure option to disable valgrind
v3: - fix typo in meson_options (Nicholas)
v4: - Remove dead code (Eric)
    - Remove change to generator that was from v0 (Eric)
    - replace if chain with loop (Eric)
    - Fix typos (Eric)
    - define HAVE_DLOPEN for both libdl and builtin dl cases (Eric)
v5: - rebase on util string buffer implementation

Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
Reviewed-by: Eric Anholt <eric@anholt.net> (v4)
2017-09-27 09:12:19 -07:00
Dylan Baker 848da66222 intel: use a flag instead of setting PYTHONPATH
Meson doesn't allow setting environment variables for custom targets, so
we either need to not pass this as an environment variable or use a
shell script to wrap the invocation. The chosen solution has the
advantage of working for both autotools and meson.

v2: - put rules back in top scope (Ken)

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Dylan Baker <dylanx.c.baker@intel.com>
2017-09-27 09:07:28 -07:00
Kenneth Graunke a553eb0fdf i965: Support copy propagating of untyped atomic surface indexes.
In the vec4 backend, SHADER_OPCODE_UNTYPED_ATOMIC's src[1] is the
surface index.  We want to copy propagate so we can use an immediate
message descriptor, rather than an indirect send.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-09-26 15:35:14 -07:00
Kenneth Graunke 66342c997f i965/vec4: Fix swizzles on atomic sources.
Atomic operation sources are scalar values, but we were failing to
select the .x component of the second operand.  For example,

   atomicCounterCompSwapARB(counter, 5u, 10u)

would generate

   mov(8) vgrf4.x:D, 5D
   mov(8) vgrf5.x:D, 10D

   mov(8) vgrf9.x:UD, vgrf4.xyzw:D
   mov(8) vgrf9.y:UD, vgrf5.xyzw:D

which wrongly selects the .y component of vgrf5, so the actual 10u value
would get dead code eliminated.  The swizzle works for the other source,
but both of them ought to be .xxxx.

Fixes the compare and swap CTS tests in:
KHR-GL45.shader_atomic_counter_ops_tests.ShaderAtomicCounterOpsExchangeTestCase

Cc: "17.2 17.1 17.0 13.0" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-09-26 15:35:11 -07:00
Kenneth Graunke a62fe34098 i965/vec4: Actually handle atomic op intrinsics.
Embarassingly, someone enabled the ARB_shader_atomic_counter_ops
extension for Gen7+ but never added the intrinsics to the switch
statement in the vec4 backend, so they just hit an unreachable()
call and died.

Fixes: 40dd45d0c6 (i965: Enable ARB_shader_atomic_counter_ops)
Cc: "17.2 17.1 17.0 13.0" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-09-26 15:35:06 -07:00
Timothy Arceri 49e4248a93 i965/nir: export nir_optimize
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Eduardo Lima Mitev <elima@igalia.com>
2017-09-26 22:37:02 +10:00
Kenneth Graunke c9fbe772ba i965: Handle unwritten PSIZ/VIEWPORT/LAYER outputs in vec4 shaders.
This can occur if the shader is capturing some of the values from the
VUE header for transform feedback, but the shader hasn't written all of
them.

Reviewed-by: Juan A. Suarez Romero <jasuarez@igalia.com>
2017-09-21 09:39:27 -07:00
Jason Ekstrand d496780fb2 intel/eu/validate: Look up types on demand in execution_type()
We are looking up the execution type prior to checking how many sources
we have.  This leads to looking for a type for src1 on MOV instructions
which is bogus.  On BDW+, the src1 register type overlaps with the
64-bit immediate and causes us problems.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
2017-09-12 15:01:00 -07:00
Matt Turner dff75c7175 i965: Drop unnecessary conditional
Clang doesn't realize that 0 and 1 are the only possibilities, a thinks
lots of variables might be uninitialized.

Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Eric Engestrom <eric.engestrom@imgtec.com>
2017-08-29 15:20:57 -07:00
Topi Pohjolainen 5dd072380a intel/compiler: Cast reg types explicitly
Makes coverity happier.

CID: 1416799
Fixes: c1ac1a3d25 (i965: Add a brw_hw_type_to_reg_type() function)

Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-08-28 14:43:39 +03:00
Jason Ekstrand 95f533d922 anv,i965: Move CS shared lowering into anv
Right now, OpenGL uses the GLSL lowering for shared variables and anv
uses NIR to lower them.  For a long time, we've done this weird thing
where we do the NIR lowering unconditionally and then add the SLM sizes
from the two together.  This works because one of them will always be 0
but it's a bit sketchy.  Let's just move the NIR-based lowering into
anv_pipeline and get rid of the sketch.

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-08-24 16:34:29 -07:00
Kenneth Graunke 4ffa9f3635 i965: Stop using wm_prog_data->binding_table.render_target_start.
Render target surfaces always start at binding table index 0.
This is required for us to use headerless FB writes, which we
really want to do.  So, we'll never change that.

Given that, it's not necessary to look up a wm_prog_data field
which we already know contains 0.  We can drop the dependency in
brw_renderbuffer_surfaces (Gen4-5)...which was already confusingly
missing from gen6_renderbuffer_surfaces.

Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-08-23 11:55:17 -07:00
Kenneth Graunke 274afad4cd i965: Add a brw_wm_prog_data::has_render_target_reads field.
State upload code should use prog_data rather than poking at shader_info
directly.

Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-08-23 11:55:17 -07:00
Matt Turner d37d9f84ac i965: Mark functions static
Cuts 300 bytes of .text

Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-08-21 14:45:44 -07:00
Matt Turner f30902629c i965/vec4: Use 'class' src_reg, rather than 'struct' src_reg
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
2017-08-21 14:45:44 -07:00
Matt Turner a77d5b28ac i965/vec4: Return float from spill_cost_for_type()
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-08-21 14:45:44 -07:00
Matt Turner a98b1a8922 i965: Optimize reading the destination type
brw_hw_type_to_reg_type() needs to know only whether the file is
BRW_IMMEDIATE_VALUE or not, which is not a valid file for the
destination. gcc and clang will evaluate __builtin_strcmp() at compile
time, so we can use it to pass a constant file for the destination.

   text	   data	    bss	    dec	    hex	filename
7816214	 346248	 420496	8582958	 82f72e	i965_dri.so before
7816070	 346248	 420496	8582814	 82f69e	i965_dri.so after

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 91ef949054 i965: Mark brw_hw_type_to_reg_type() as a pure function
text	   data	    bss	    dec	    hex	filename
7816886	 346248	 420496	8583630	 82f9ce	i965_dri.so before
7816214	 346248	 420496	8582958	 82f72e	i965_dri.so after

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner e07fe89035 i965: Hide the register type hardware encodings
So we stop mixing them with the logical enum.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 4fab67a441 i965: Stop using hardware register types directly
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner c746f1c888 i965: Add brw_hw_reg_type_to_letters() and use it in brw_disasm.c
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 6a2471b501 i965: Move brw_reg_type_letters() as well
And add "to_" to the name for consistency with the other functions in
this file.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 1cb0a7941b i965: Switch to using the logical register types
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner cb2cd462b1 i965: Add functions to abstract access to register types
Previously the brw_inst{,_set}_{dst,src0,src1}_reg_type() functions
provided access to the hardware encodings for the register types. We
often mixed these with the logical BRW_REGISTER_TYPE_* enums (which
themselves used to be the hardware format!) with bad results.

With that functionality now available with the hw_ versions (see
previous commit), we now add functions that take the logical
BRW_REGISTER_TYPE_* enums and convert into the hardware format and vice
versa. To do the conversion we also have to provide the file.

Note the asymmetry between the two functions: the new getter reads the
file from the instruction word, and to ensure that is always set the
setter writes both the file and the type.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 9fb8323328 i965: Rename brw_inst's functions that access the register type
Put hw_ in the name so that it's clear these are the hardware encodings.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 3e379af492 i965: Index brw_hw_reg_type_to_size()'s table by logical type
I'll be transitioning everything to use the logical types.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner c1ac1a3d25 i965: Add a brw_hw_type_to_reg_type() function
Will be used in later commits.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner dbe7dd13dd i965: Use a common table to translate logical to hardware types
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner bfcc9aa829 i965: Extract functions dealing with register types to separate file
I'm going to encapsulate all of the logic dealing with register types in
this file.

Rename the parameters for the hardware encodings from type -> hw_type at
the same time.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 890f863da0 i965: Reverse file/type arguments to register type functions
I think of the initial arguments as "state" and the last as the actual
subject.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 92f787ff86 i965: Add support for disassembling 64-bit integer immediates
After the last patch converted things into enums, I helpfully got a
compiler warning about these missing from the switch statement.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner deae25ce37 i965: Use separate enums for register vs immediate types
The hardware encodings often mean different things depending on whether
the source is an immediate.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 8815b9677f i965: Reorder brw_reg_type enum values
These vaguely corresponded to the hardware encodings, but that is purely
historical at this point. Reorder them so we stop making things "almost
work" when mixing enums.

The ordering has been closen so that no enum value is the same as a
compatible hardware encoding.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner ce6b8627d8 i965: Validate destination restrictions with vector immediates
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 1d79c828d8 i965: Don't let raw-move check be tricked by immediate vector types
UB and B type encodings are the same as UV and VF. Noticed when writing
the following patch.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 48aa6ecb87 i965: Only change type of 0.0f to VF if destination stride == 1
The destination stride must be equivalent to a dword if VF is used.

Also, since the only compaction table entires with "i:vf" have the
destination as "r:f" specifically check that the destination is of type
float.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 56a676eed2 i965: Remove CONT/BREAK from instruction compaction test
These cannot be compacted. A similar mistake was fixed in commit
90eaf01616

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 3d661e6062 i965: Test instruction compaction on all supported Gens
Note that there's no point in testing on G45, since its compaction is
the same as Gen5. Same logic applies to Gen7 variants and low-power
parts.

Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 9ff7d9b853 i965: Silence signed/unsigned comparison warning
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner eac89911e5 i965: Move compaction "prepass" into brw_eu_compact.c
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Matt Turner 17641f6388 i965: Mark src inst pointer const in compaction code
Reviewed-by: Scott D Phillips <scott.d.phillips@intel.com>
2017-08-21 14:05:23 -07:00
Iago Toral Quiroga 81615ad444 intel/compiler: properly size attribute wa_flags array for Vulkan
Mesa will map user defined vertex input attributes to slots
starting at VERT_ATTRIB_GENERIC0 which gives us room for only 16
slots (up to GL_VERT_ATTRIB_MAX). This sufficient for GL, where
we expose exactly 16 vertex attributes for user defined inputs, but
in Vulkan we can expose up to 28 (which are also mapped from
VERT_ATTRIB_GENERIC0 onwards) so we need to account for this when
we scope the size of the array of attribute workaround flags
that is used during the brw_vertex_workarounds NIR pass. This
prevents out-of-bounds accesses in that array for NIR shaders
that use more than 16 vertex input attributes.

Fixes:
dEQP-VK.pipeline.vertex_input.max_attributes.*

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-08-11 10:41:44 +02:00
Dave Airlie 271fa3a684 intel/vec4/gs: reset nr_pull_param if DUAL_INSTANCED compile failed.
If dual object compile fails (as seems to happen with virgl a
fair bit, and does piglit even have any tests for it?), we end up
not restarting the pull params, so we call
vec4_visitor::move_uniform_array_access_to_pull_constant
a second time and it runs over the ends of the alloc.

Fixes: tests/spec/glsl-1.50/execution/geometry/max-input-components.shader_test
running inside virgl on ivybridge.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: <mesa-stable@lists.freedesktop.org>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2017-08-03 16:54:08 +10:00
Matt Turner 858f554078 i965: Fix indentation 2017-08-02 16:49:32 -07:00
Kenneth Graunke 30d6bc470a i965: Set lower_vote_trivial in vector_nir_options_gen6 too.
There's a second struct for Gen6+.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-21 18:09:01 -07:00
Matt Turner 069bf7c907 i965/fs: Match destination type to size for ballot
No use in taking a 64-bit value when we know the high 32-bits are zero.
2017-07-20 16:56:50 -07:00
Matt Turner 1038d385a9 nir: Reduce destination size of ballot intrinsic when possible
Some hardware, like i965, doesn't support group sizes greater than 32.
In that case, we can reduce the destination size of the ballot
intrinsic, which will simplify our code generation.

Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Matt Turner 782ef30451 i965/fs: Implement ARB_shader_ballot operations
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Matt Turner 8238930510 i965/fs: Do not move MOVs writing the flag outside of control flow
The implementation of ballotARB() will start by zeroing the flags
register. So, a doing something like

        if (gl_SubGroupInvocationARB % 2u == 0u) {
                ... = ballotARB(true);
		[...]
        } else {
                ... = ballotARB(true);
		[...]
	}

(like fs-ballot-if-else.shader_test does) would generate identical MOVs
to the same destination (the flag register!), and we definitely do not
want to pull that out of the control flow.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Francisco Jerez f1b7c47913 i965/fs: Handle explicit flag sources in flags_read()
The implementations of the ARB_shader_ballot intrinsics will explicitly
read the flag as a source register.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-20 16:56:49 -07:00
Matt Turner 43ef75b394 nir: Add system values from ARB_shader_ballot
We already had a channel_num system value, which I'm renaming to
subgroup_invocation to match the rest of the new system values.

Note that while ballotARB(true) will return zeros in the high 32-bits on
systems where gl_SubGroupSizeARB <= 32, the gl_SubGroup??MaskARB
variables do not consider whether channels are enabled. See issue (1) of
ARB_shader_ballot.

Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Matt Turner ee9fa4ac18 i965/fs: Implement ARB_shader_group_vote operations
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Francisco Jerez 93dc736f4e i965/fs: Handle explicit flag destinations in flags_written()
The implementations of the ARB_shader_group_vote intrinsics will
explicitly write the flag as the destination register.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-20 16:56:49 -07:00
Matt Turner 30b72f4126 i965/vec4: Lower ARB_shader_group_vote intrinsics
I don't expect anyone is going to care about using this in vec4 programs
(vertex/tessellation/geometry on Gen6/7), no one has come up with a good
way to implement it much less test it.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Matt Turner d4c9d6a3b2 nir: Add pass to optimize intrinsics
Specifically, constant fold intrinsics from ARB_shader_group_vote, but I
suspect it'll be useful for other things in the future.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-07-20 16:56:49 -07:00
Kenneth Graunke b2da123801 i965: Use pushed UBO data in the scalar backend.
This actually takes advantage of the newly pushed UBO data, avoiding
pull loads.

Improves performance in GLBenchmark Manhattan 3.1 by:

   HSW: ~1%, BDW/SKL/KBL GT2: 3-4%, SKL GT4: 7-8%, APL: 4-5%.
   (thanks to Eero Tamminen for these numbers)

shader-db results on Skylake, ignoring programs with spill/fill changes:

   total instructions in shared programs: 13963994 -> 13651893 (-2.24%)
   instructions in affected programs: 4250328 -> 3938227 (-7.34%)
   helped: 28527
   HURT: 0

   total cycles in shared programs: 179808608 -> 172535170 (-4.05%)
   cycles in affected programs: 79720410 -> 72446972 (-9.12%)
   helped: 26951
   HURT: 1248

   LOST:   46
   GAINED: 21

Many "Deus Ex: Mankind Divided" shaders which already spilled end up
spill a lot more (about 240 programs hurt, 9 helped).  The cycle
estimator suggests this is still overall a win (-0.23% in cycle counts)
presumably because we trade pull loads for fills.

v2: Drop "PULL" environment variable left in for initial debugging
    (caught by Matt).

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-13 20:18:54 -07:00
Kenneth Graunke c9ef27e77b i965: Factor out push locations.
With UBOs, the answer of "have we decided to push this uniform" gets
a bit more complicated - for one, we have multiple surfaces.  This
patch refactors things so we can add the new code in a single place.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-13 20:18:54 -07:00
Kenneth Graunke 4f586cd8f1 i965: Push UBO data, but don't use it just yet.
This patch starts uploading UBO data via 3DSTATE_CONSTANT_* packets,
and updates the compiler to know that there's extra payload data, so
things continue working.  However, it still issues pull loads for all
data.  I wanted to separate the two aspects for greater bisectability.

v2: Update for new intel_bufferobj_buffer parameter.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-13 20:18:30 -07:00
Kenneth Graunke 6d28c6e52c i965: Select ranges of UBO data to be uploaded as push constants.
This adds a NIR pass that decides which portions of UBOS we should
upload as push constants, rather than pull constants.

v2: Switch to uint16_t for the UBO block number, because we may
    have a lot of them in Vulkan (suggested by Jason).  Add more
    comments about bitfield trickery (requested by Matt).

v3: Skip vec4 stages for now...I haven't finished wiring up support
    in the vec4 backend, and so pushing the data but not using it
    will just be wasteful.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-13 19:56:49 -07:00
Kenneth Graunke 8ec5a4e4a4 i965: Switch to absolute addressing for constant buffer 0.
By default, 3DSTATE_CONSTANT_* Constant Buffer 0 is relative to dynamic
state base address.  This makes it unusable for pushing UBOs.  I'd like
to be able to use all four push buffers.

There is a bit in the INSTPM register (or CS_DEBUG_MODE2 on Skylake)
which controls whether buffer 0 is relative to dynamic state base
address, or simply a normal pointer.  Setting that gives us full
flexibility.

We can't currently write this on Haswell and earlier, and will need
to update the kernel command parser, and then do the whole version
checking song and dance.

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-07-13 19:56:49 -07:00
Lionel Landwerlin 226fae7849 intel/compiler: no need to check unsigned is >= 0
CID: 1338342
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
2017-07-13 22:50:45 +01:00
Lionel Landwerlin 95c917668c intel/compiler: don't check unsigned is >= 0
CID: 1224468
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
2017-07-13 22:50:38 +01:00
Lionel Landwerlin a25a533458 intel/compiler: remove check unsigned is >= 0
By definition unsigned are always >= 0.

CID: 742212
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
2017-07-13 22:50:29 +01:00
Anuj Phogat 0a56c5f3f1 intel/compiler: Don't use opt_sampler_eot() optimization on gen10+
This optimization has been removed on gen10+.

Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
2017-07-12 11:27:31 -07:00
Johnson Lin 165e704719 i965/i915: Add UYVY as the supported format
Trigger the correct sampler options for it. Similar with YUYV

Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
2017-06-30 10:16:26 +01:00
Lionel Landwerlin 030abc6109 intel: compiler/i965: fix is_broxton checks
In 5f2fe9302c is_geminilake was introduced for the differenciate
broxton from geminilake. Unfortunately I failed as verifying that
is_broxton is throughout the code base to mean Gen9lp.

Fixes: 5f2fe9302c ("intel: common: add flag to identify platforms by name")
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-06-20 23:26:42 +01:00
Anuj Phogat f9e31a26d4 i965/cnl: Make URB {VS, GS, HS, DS} sizes non multiple of 3
v1: By Ben Widawsky <benjamin.widawsky@intel.com>
v2: v1 had an assert only for VS. Add the restriction for GS, HS and
    DS as well and make sure the allocated sizes are not multiple of 3.
v3: Move the entry_size checks in to compiler code (Ken)

Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-06-09 16:02:59 -07:00
Anuj Phogat 111881abac i965/cnl: Handle gen10 in switch cases across the driver
V2: Start using gen10 functions isl_gen10*(), gen10_blorp_exec()
    gen10_init_atoms() (Jason)
    Remove Vulkan changes. Do them later in a separate patch.

Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-06-09 16:02:59 -07:00
Anuj Phogat 30e749c8f1 i965/cnl: Update few assertions
Signed-off-by: Anuj Phogat <anuj.phogat@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-06-09 16:02:59 -07:00
Eric Engestrom 63a8a88ac4 tree-wide: remove trailing backslash
Simple search for a backslash followed by two newlines.
If one of the newlines were to be removed, this would cause issues, so
let's just remove these trailing backslashes.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
2017-06-07 01:18:09 +01:00
Kenneth Graunke 9cd69022d5 i965: Change INTEL_DEBUG=vec4 to INTEL_SCALAR_VS for consistency.
We moved to INTEL_SCALAR_* when we added more than a single stage, but
never went back and converted the VS to work that way.  Be consistent.

Also update the documentation to actually mention these debug variables.

Acked-by: Jason Ekstrand <jason@jlekstrand.net>
2017-06-05 23:32:40 -07:00
Kenneth Graunke fe14a9a501 i965: Drop duplicate shadow variable.
We already initialized this at the top of the function.

Trivial.
2017-06-01 14:28:12 -07:00
Kenneth Graunke 65f5f3c85c i965: Move SOL PSIZ hacks from draw time to link time.
We can just update the gl_transform_feedback_info fields at link time
to make the VUE header fields have the right location and component.
Then we don't need to handle them specially at draw time, which is
expensive.

Reviewed-by: Rafael Antognolli <rafael.antognolli@intel.com>
2017-06-01 00:08:29 -07:00
Kenneth Graunke 1e3880544e i965: Ignore INTEL_SCALAR_* debug variables on Gen10+.
Scalar mode has been default since Broadwell, and vector mode is getting
increasingly unmaintained.  There are a few things that don't even fully
work in vector mode on Skylake, but we've never cared because nobody
uses it.  There's no point in porting it forward to new platforms.

So, just ignore the debug options to force it on.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-05-29 21:40:44 -07:00
Jason Ekstrand 18e18a1863 i965: Move clip program compilation to the compiler
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-05-26 07:58:01 -07:00
Jason Ekstrand 9fb8a8775b i965: Move SF compilation to the compiler
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-05-26 07:58:01 -07:00
Jason Ekstrand 21ba2b4bef intel/compiler: Make brw_disasm take const assembly
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-05-26 07:58:01 -07:00
Samuel Iglesias Gonsálvez e69e5c7006 i965/vec4: load dvec3/4 uniforms first in the push constant buffer
Reorder the uniforms to load first the dvec4-aligned variables in the
push constant buffer and then push the vec4-aligned ones. It takes
into account that the relocated uniforms should be aligned to their
channel size.

This fixes a bug were the dvec3/4 might be loaded one part on a GRF and
the rest in next GRF, so the region parameters to read that could break
the HW rules.

v2:
- Fix broken logic.
- Add a comment to explain what should be needed to optimise the usage
  of the push constant buffer slots, as this patch does not pack the
  uniforms.

v3:
- Implemented the push constant buffer usage optimization.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Acked-by: Francisco Jerez <currojerez@riseup.net>
2017-05-18 06:49:54 +02:00
Samuel Iglesias Gonsálvez 8aa6ada838 i965/vec4: fix swizzle and writemask when loading an uniform with constant offset
It was setting XYWZ swizzle and writemask to all uniforms, no matter if they
were a vector or scalar, so this can lead to problems when loading them
to the push constant buffer.

Moreover, 'shift' calculation was designed to calculate the offset in
DWORDS, but it doesn't take into account DFs, so the calculated swizzle
for the later ones was wrong.

The indirect case is not changed because MOV INDIRECT will write
to all components. Added an assert to verify that these uniforms
are aligned.

v2:
- Fix 'shift' calculation (Curro)
- Set both swizzle and writemask.
- Add assert(shift == 0) for the indirect case.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-05-18 06:49:54 +02:00
Samuel Iglesias Gonsálvez 354f7f2cb9 i965/vec4/gs: restore the uniform values which was overwritten by failed vec4_gs_visitor execution
We are going to add a packing feature to reduce the usage of the push
constant buffer. One of the consequences is that 'nr_params' would be
modified by vec4_visitor's run call, so we need to restore it if one of
them failed before executing the fallback ones. Same thing happens to the
uniforms values that would be reordered afterwards.

Fixes GL45-CTS.arrays_of_arrays_gl.InteractionFunctionCalls2 when
the dvec4 alignment and packing patch is applied.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Acked-by: Francisco Jerez <currojerez@riseup.net>
2017-05-18 06:49:28 +02:00
Matt Turner 169e1e26ee i965: Fix test_eu_validate.cpp
Broken by commit a7217e909c ("i965: Pass pointer and end of assembly
to brw_validate_instructions").

Reported-by: Aaron Watry <awatry@gmail.com>
2017-05-16 11:45:07 -07:00
Matt Turner aae2626be8 i965: Add a weak no-op nir_print_instr() symbol
intel_asm_annotation.c is part of libintel_compiler.la, which contains
code for disassembling and validating shaders that we want to call in
aubinator_error_decode.

dump_assembly() calls nir_print_instr() to print annotations, and
although dump_assembly() is not called by aubinator_error_decode (nor is
any function in intel_asm_annotation.c) it causes undefined references
to nir_print_instr().

To work around, provide a no-op weak symbol to resolve against.
2017-05-15 11:43:01 -07:00
Matt Turner d98e82c772 i965: Allow brw_eu_validate to handle compact instructions
This will allow the validator to run on shader programs we find in the
GPU hang error state.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-05-15 11:42:56 -07:00
Matt Turner a7217e909c i965: Pass pointer and end of assembly to brw_validate_instructions
This will allow us to more easily run brw_validate_instructions() on
shader programs we find in GPU hang error states.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
2017-05-15 11:42:47 -07:00
Jason Ekstrand 037ce253b1 i965/vec4: Delete the system value infastructure
The only thing still using it is INVOCATION_ID for geometry shaders.
That's easily enough inlined into the nir_intrinsic_load_invocation_id
handling code.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:07 -07:00
Jason Ekstrand 2e9916ea04 i965/vec4: Use NIR to do GS input remapping
We're already doing this in the FS back-end.  This just does the same
thing in the vec4 back-end.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:07 -07:00
Jason Ekstrand e31042ab40 i965/fs: Move remapping of gl_PointSize to the NIR level
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:06 -07:00
Jason Ekstrand 5b00c3cc05 i965/nir: Inline remap_inputs_with_vue_map
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:06 -07:00
Jason Ekstrand 0d5f89cdc3 i965/vec4: Use NIR remapping for VS attributes
The NIR pass already handles remapping system values to attributes for
us so we delete the system value code as part of the conversion.

We also change nir_lower_vs_inputs to take an explicit inputs_read
bitmask and pass in the inputs_read from prog_data instead from pulling
it out of NIR.  This is because the version in prog_data may get
EDGEFLAG added to it on some old platforms.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:06 -07:00
Jason Ekstrand 80aa6e9d32 intel/compiler/vs: Move inputs_read handling to generic code
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:08:03 -07:00
Jason Ekstrand d2fe804d18 i965/vec4: Set VERT_BIT_EDGEFLAG based on the VUE map
We also add a nice little comment to make it more clear exactly what
happens with the edge flag copy.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Jason Ekstrand ca4d192802 i965/fs: Lower gl_VertexID and friends to inputs at the NIR level
NIR calls these system values but they come in from the VF unit as
vertex data.  It's terribly convenient to just be able to treat them as
such in the back-end.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Jason Ekstrand 24e6fba500 i965/vs: Set uses_vertexid and friends from brw_compile_vs
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Jason Ekstrand 5e832302dc i965: Move multiply by 4 for VS ATTR setup into the scalar backend.
The vec4 backend will want to count in units of vec4s, not scalar
components.  The simplest solution is to move the multiplication by 4
into the scalar backend.  This also improves consistency with how we
count varyings.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Jason Ekstrand 36764b6923 i965/nir: Inline remap_vs_attrs
Now that we have nice block iterators, there's no good reason for this
to be off on it's own.  While we're here, we convert to using the NIR
const index getters/setters instead of whacking const_index values
directly.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Jason Ekstrand b86dba8a0e nir: Embed the shader_info in the nir_shader again
Commit e1af20f18a changed the shader_info
from being embedded into being just a pointer.  The idea was that
sharing the shader_info between NIR and GLSL would be easier if it were
a pointer pointing to the same shader_info struct.  This, however, has
caused a few problems:

 1) There are many things which generate NIR without GLSL.  This means
    we have to support both NIR shaders which come from GLSL and ones
    that don't and need to have an info elsewhere.

 2) The solution to (1) raises all sorts of ownership issues which have
    to be resolved with ralloc_parent checks.

 3) Ever since 00620782c9, we've been
    using nir_gather_info to fill out the final shader_info.  Thanks to
    cloning and the above ownership issues, the nir_shader::info may not
    point back to the gl_shader anymore and so we have to do a copy of
    the shader_info from NIR back to GLSL anyway.

All of these issues go away if we just embed the shader_info in the
nir_shader.  There's a little downside of having to copy it back after
calling nir_gather_info but, as explained above, we have to do that
anyway.

Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-09 15:07:47 -07:00
Lionel Landwerlin 32f14332f5 intel: compiler: prevent integer overflow
CID: 1399477, 1399478 (Integer handling issues)

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-05-09 13:56:17 +01:00
Lionel Landwerlin 85182e490c intel: compiler: remove duplicated code
CID: 1399470: (Control flow issues)

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-05-09 13:56:17 +01:00
Rafael Antognolli 8fa8abef4b i965: Move enums to brw_compiler.h.
These enums live inside struct brw_wm_prog_data, so it makes sense to
keep them in the same header. It also allows to use them without
including brw_eu_defines.h.

Signed-off-by: Rafael Antognolli <rafael.antognolli@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-05-03 16:55:58 -07:00
Samuel Iglesias Gonsálvez f57e234fdd i965/vec4: don't modify regioning parameters to the sources of DF align1 instructions
The regioning parameters are now properly set by convert_to_hw_regs()
and we don't need to fix them in the generator. That latter fix
previously done in the generator was strictly speaking wrong for any
non-identity regions.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-05-03 15:32:39 +02:00
Samuel Iglesias Gonsálvez aaeb1c99be i965/vec4: fix register width for DF VGRF and UNIFORM
On gen7, the swizzles used in DF align16 instructions works for element
size of 32 bits, so we can address only 2 consecutive DFs. As we assumed that
in the rest of the code and prepare the instructions for this (scalarize_df()),
we need to set it to two again.

However, for DF align1 instructions, a width of 2 is wrong as we are not
reading the data we want. For example, an uniform would have a region of
<0, 2, 1> so it would repeat the first 2 DFs, when we wanted to access
to the first 4.

This patch sets the default one to 4 and then modifies the width of
align16 instruction's DF sources when we translate the logical swizzle
to the physical one.

v2:
- Remove conditional (Curro).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-05-03 15:32:39 +02:00
Samuel Iglesias Gonsálvez 7f728bce81 i965/vec4: fix vertical stride to avoid breaking region parameter rule
From IVB PRM, vol4, part3, "General Restrictions on Regioning
Parameters":

  "If ExecSize = Width and HorzStride ≠ 0, VertStride must
   be set to Width * HorzStride."

In next patch, we are going to modify the region parameter for
uniforms and vgrf. For uniforms that are the source of
DF align1 instructions, they will have <0, 4, 1> regioning and
the execsize for those instructions will be 4, so they will break
the regioning rule. This will be the same for VGRF sources where
we use the vstride == 0 exploit.

As we know we are not going to cross the GRF boundary with that
execsize and parameters (not even with the exploit), we just fix
the vstride here.

v2:
- Move is_align1_df() (Curro)
- Refactor exec_size == width calculation (Curro)

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Cc: "17.1" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-05-03 15:32:39 +02:00
Francisco Jerez 58324389be intel/fs: Take into account amount of data read in spilling cost heuristic.
Until now the spilling cost calculation was neglecting the amount of
data read from the register during the spilling cost calculation.
This caused it to make suboptimal decisions in some cases leading to
higher memory bandwidth usage than necessary.

Improves Unigine Heaven performance by ~4% on BDW, reversing an
unintended FPS regression from my previous commit
147e71242c with n=12 and statistical
significance 5%.  In addition SynMark2 OglCSDof performance is
improved by an additional ~5% on SKL, and a Kerbal Space Program
apitrace around the Moho planet I can provide on request improves by
~20%.

Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-04-24 11:01:40 -07:00
Francisco Jerez ecc19e12dc intel/fs: Use regs_written() in spilling cost heuristic for improved accuracy.
This is what we use later on to compute the number of registers that
will actually get spilled to memory, so it's more likely to match
reality than the current open-coded approximation.

Cc: <mesa-stable@lists.freedesktop.org>
Reviewed-by: Plamena Manolova <plamena.manolova@intel.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-04-24 10:59:56 -07:00
Kenneth Graunke 6b10c37b9c i965/vec4: Use reads_accumulator_implicitly(), not MACH checks.
Curro pointed out that I should not just check for MACH, but use
the reads_accumulator_implicitly() helper, which would also prevent
the same bug with MAC and SADA2 (if we ever decide to use them).

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-24 10:53:49 -07:00
Timothy Arceri 7a7ee40c2d nir/i965: add before ffma algebraic opts
This shuffles constants down in the reverse of what the previous
patch does and applies some simpilifications that may be made
possible from doing so.

Shader-db results BDW:

total instructions in shared programs: 12980814 -> 12977822 (-0.02%)
instructions in affected programs: 281889 -> 278897 (-1.06%)
helped: 1231
HURT: 128

total cycles in shared programs: 246562852 -> 246567288 (0.00%)
cycles in affected programs: 11271524 -> 11275960 (0.04%)
helped: 1630
HURT: 1378

V2: mark float opts as inexact

Reviewed-by: Elie Tournier <elie.tournier@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
2017-04-24 12:08:14 +10:00
Kenneth Graunke 2faf227ec2 i965/vec4: Avoid reswizzling MACH instructions in opt_register_coalesce().
opt_register_coalesce() was optimizing sequences such as:

   mul(8) acc0:D, attr18.xyyy:D, attr19.xyyy:D
   mach(8) vgrf5.xy:D, attr18.xyyy:D, attr19.xyyy:D
   mov(8) m4.zw:F, vgrf5.xxxy:F

into:

   mul(8) acc0:D, attr18.xyyy:D, attr19.xyyy:D
   mach(8) m4.zw:D, attr18.xxxy:D, attr19.xxxy:D

This doesn't work - if we're going to reswizzle MACH, we'd need to
reswizzle the MUL as well.  Here, the MUL fills the accumulator's .zw
components with attr18.yy * attr19.yy.  But the MACH instruction expects
.z to contain attr18.x * attr19.x.  Bogus results ensue.

No change in shader-db on Haswell.  Prevents regressions in Timothy's
patches to use enhanced layouts for varying packing (which rearrange
code just enough to trigger this pre-existing bug, but were fine
themselves).

Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-04-22 00:01:16 -07:00
Matt Turner 2eeb1b0ad9 i965: Use correct VertStride on align16 instructions.
In commit c35fa7a, we changed the "width" of DF source registers to 2,
which is conceptually fine. Unfortunately a VertStride of 2 is not
allowed by align16 instructions on IVB/BYT, and the regular VertStride
of 4 works fine in any case.

See generated_tests/spec/arb_gpu_shader_fp64/execution/built-in-functions/vs-round-double.shader_test
for example:

cmp.ge.f0(8)    g18<1>DF        g1<0>.xyxyDF    -g8<2>DF        { align16 1Q };
        ERROR: In Align16 mode, only VertStride of 0 or 4 is allowed
cmp.ge.f0(8)    g19<1>DF        g1<0>.xyxyDF    -g9<2>DF        { align16 2N };
        ERROR: In Align16 mode, only VertStride of 0 or 4 is allowed

v2:
- Add spec quote (Curro).
- Change the condition to only BRW_VERTICAL_STRIDE_2 (Curro)

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:09 -07:00
Samuel Iglesias Gonsálvez d8441e2276 i965/vec4/dce: improve track of partial flag register writes
This is required for correctness in presence of multiple 4-wide flag
writes (e.g. 4-wide instructions with a conditional mod set) which
update a different portion of the same 8-bit flag subregister.

Right now we keep track of flag dataflow with 8-bit granularity and
consider flag writes to have killed any previous definition of the
same subregister even if the write was less than 8 channels wide,
which can cause live flag register updates to be dead
code-eliminated incorrectly.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:09 -07:00
Samuel Iglesias Gonsálvez c1fc8fad47 i965/vec4: don't do horizontal stride on some register file types
horiz_offset() shouldn't be doing anything for scalar registers,
because all channels of any SIMD instructions will end up reading or
writing the same component of the register, so shifting the register
offset would be wrong.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Re-implement in terms of is_uniform() for
  simplicity.  Pass argument by const reference.  Clarify commit
  message. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:09 -07:00
Matt Turner 21e8e3a848 i965/vec4: Fix exec size for MOVs {SET,PICK}_{HIGH,LOW}_32BIT.
Otherwise for a pack_double_2x32_split opcode, we emit:

   vec1 64 ssa_135 = pack_double_2x32_split ssa_133, ssa_134
mov(8)          g5<1>UD         g5<4>.xUD                       { align16 1Q compacted };
mov(8)          g7<2>UD         g5<4,4,1>UD                     { align1 1Q };
        ERROR: When the destination spans two registers, the source must span two registers
               (exceptions for scalar source and packed-word to packed-dword expansion)
mov(8)          g8<2>UD         g5.4<4,4,1>UD                   { align1 2N };
        ERROR: The offset from the two source registers must be the same
mov(8)          g5<1>UD         g6<4>.xUD                       { align16 1Q compacted };
mov(8)          g7.1<2>UD       g5<4,4,1>UD                     { align1 1Q };
        ERROR: When the destination spans two registers, the source must span two registers
               (exceptions for scalar source and packed-word to packed-dword expansion)
mov(8)          g8.1<2>UD       g5.4<4,4,1>UD                   { align1 2N };
        ERROR: The offset from the two source registers must be the same

The intention was to emit mov(4)s for the instructions that have ERROR
annotations.

See tests/spec/arb_gpu_shader_fp64/execution/vs-isinf-dvec.shader_test
for example.

v2 (Samuel):
- Instead of setting the exec size to a fixed value, don't double it
(Curro).
- Add PICK_{HIGH,LOW}_32BIT to the condition.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Trivial rebase changes. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:09 -07:00
Samuel Iglesias Gonsálvez f030aaf2fb i965/vec4: use vec4_builder to emit instructions in setup_imm_df()
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Drop useless vec4_visitor dependencies.  Demote to
  static stand-alone function.  Don't write unused components in the
  result.  Use vec4_builder interface for register allocation. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:09 -07:00
Juan A. Suarez Romero a907c91e93 i965/vec4: consider subregister offset in live variables
Take into account offset values less than a full register (32 bytes)
when getting the var from register.

This is required when dealing with an operation that writes half of the
register (like one d2x in IVB/BYT, which uses exec_size == 4).

v2:
- Take in account this offset < 32 in liveness analysis too (Curro)

v3:
- Change formula in var_from_reg() (Curro)
- Remove useless changes (Curro)

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Francisco Jerez 92649a3e67 i965/vec4: fix assert to detect SIMD lowered DF instructions in IVB
On IVB, DF instructions have lowered the SIMD width to 4 but the
exec_size will be later doubled. Fix the assert to avoid crashing in
this case.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Simplify assert.  Except for the 'inst->group % 4
  == 0' part the assertion was redundant with the previous assertion. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Samuel Iglesias Gonsálvez 6e3265eae5 i965/vec4: split VEC4_OPCODE_FROM_DOUBLE into one opcode per destination's type
This way we can set the destination type as double to all these new opcodes,
avoiding any optimizer's confusion that was happening before.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Drop no_spill workaround originally needed due to
  the bogus destination type of VEC4_OPCODE_FROM_DOUBLE. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Samuel Iglesias Gonsálvez 50a5217637 i965/vec4: split d2x conversion and data gathering from one opcode to two explicit ones
When doing a 64-bit to a smaller data type size conversion, the destination should
be aligned to 64-bits. Because of that, we need to gather the data after the
actual conversion.

Until now, these two operations were done by VEC4_OPCODE_FROM_DOUBLE but
now we split them explicitely in two different instructions:
VEC4_OPCODE_FROM_DOUBLE just do the conversion and
VEC4_OPCODE_PICK_LOW_32BIT will gather the data.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Juan A. Suarez Romero cfaf14a126 i965/vec4: fix VEC4_OPCODE_FROM_DOUBLE for IVB/BYT
In the generator we must generate slightly different code for
Ivybridge/Baytrail, because of the way the stride works in
this hardware.

v2:
- Use stride and don't need to fix dst (Curro)

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Juan A. Suarez Romero be445d3ea3 i965/vec4: keep original type when dealing with null registers
Keep the original type when dealing with null registers. Especially
because we do no want to introduce an implicit conversion between
types that could affect the conditional flags.

This affects especially when the original type is DF, and we are working
on Ivybridge/Baytrail.

v2 (Curro)
- Fix typo.
- Use retype() instead of applying the type directly.
- Remove unneeded retype.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Samuel Iglesias Gonsálvez a21dc2b500 i965/vec4: split DF instructions and later double its execsize in IVB/BYT
We need to split DF instructions in two on IVB/BYT as it needs an
execsize 8 to process 4 DF values (one GRF in total).

v2:
- Rename helper and make it static inline function (Matt).
- Fix indention and add braces (Matt).

v3:
- Don't edit IR instruction when doubling exec_size (Curro)
- Add comment into the code (Curro).
- Manage ARF registers like the others (Curro)

v4:
- Add get_exec_type() function and use it to calculate the execution
  size.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Fix bogus 'type != BAD_FILE' check.  Take
  destination type as execution type where there is no valid source.
  Assert-fail if the deduced execution type is byte.  Clarify comment
  in get_lowered_simd_width().  Move SIMD width workaround outside of
  'if (...inst->size_written > REG_SIZE)' conditional block, since the
  problem should be independent of whether the amount of data written
  by the instruction is greater or lower than a GRF.  Drop redundant
  is_ivb_df definition.  Drop bogus inst->exec_size < 8 check.
  Simplify channel group assertion. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Samuel Iglesias Gonsálvez a5399e8b1c i965/fs: lower all non-force_writemask_all DF instructions to SIMD4 on IVB/BYT
The hardware applies the same channel enable signals to both halves of
the compressed instruction which will be just wrong under non-uniform
control flow. Fix this by splitting those instructions to SIMD4.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:08 -07:00
Francisco Jerez ebfb703d44 i965/fs: Get 64-bit indirect moves working on IVB.
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-04-14 14:56:08 -07:00
Matt Turner 630b84cdc8 i965: Use source region <1,2,0> when converting to DF.
Doing so allows us to use a single MOV in VEC4_OPCODE_TO_DOUBLE instead
of two.

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-04-14 14:56:08 -07:00
Juan A. Suarez Romero 3198ce3f96 i965/fs: fix lower SIMD width for IVB/BYT's MOV_INDIRECT
According to the IVB and HSW PRMs:

"2.When the destination requires two registers and the sources are
 indirect, the sources must use 1x1 regioning mode."

So for DF instructions the execution size is not limited by the number
of address registers that are available, but by the EU decompression
logic not handling VxH indirect addressing correctly.

This patch limits the SIMD width to 4 in this case.

v2:
- Fix typo (Matt).
- Fix condition (Curro)

v3:
- Add spec quote (Curro)

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Juan A. Suarez Romero 571cbd05eb i965/fs: fix dst stride in IVB/BYT type conversions
When converting a DF to 32-bit conversions, we set dst stride to 2,
to fulfill alignment restrictions because the upper Dword of every
Qword will be written with undefined value.

But in IVB/BYT, this is not necessary, as each DF conversion already
writes 2, the first one the real value, and the second one a 0.
That is, IVB/BYT already set stride = 2 implicitly, so we must set it to
1 explicitly to avoid ending up with stride = 4.

v2:
- Fix typo (Matt)

v3:
- Fix stride in the destination's brw_reg, don't modity IR (Curro)

v4:
- Remove 'is_dst' argument of brw_reg_from_fs_reg() (Curro)
- Fix comment (Curro).
- Relax hstride assert (Curro)

Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Minor spelling fixes. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Samuel Iglesias Gonsálvez af6fc3a8ea i965/fs: rename lower_d2x to lower_conversions
v2:
- Change the name to lower_conversions.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Samuel Iglesias Gonsálvez dee31311eb Revert "i965/fs: Don't emit SEL instructions for type-converting MOVs."
This reverts commit 7dccd38b40.

d2x pass fixes SEL instructions when there is a type conversion
by doing a SEL without type conversion and then convert the result.
This pass also takes into account the non-uniform control flow.

Then, 7dccd38b40 is not needed anymore.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-04-14 14:56:07 -07:00
Samuel Iglesias Gonsálvez aeecc82d05 i965/fs: generalize the legalization d2x pass
Generalize it to lower any unsupported narrower conversion.

v2 (Curro):
- Add supports_type_conversion()
- Reuse existing intruction instead of cloning it.
- Generalize d2x to narrower and equal size conversions.

v3 (Curro):
- Make supports_type_conversion() const and improve it.
- Use foreach_block_and_inst to process added instructions.
- Simplify code.
- Add assert and improve comments.
- Remove redundant mov.
- Remove useless comment.
- Remove saturate == false assert and add support for saturation
  when fixing the conversion.
- Add get_exec_type() function.

v4 (Curro):
- Use get_exec_type() function to get sources' type.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Matt Turner 94ffeb7fa2 i965: Use <0,2,1> region for scalar DF sources on IVB/BYT.
On HSW+, scalar DF sources can be accessed using the normal <0,1,0>
region, but on IVB and BYT DF regions must be programmed in terms of
floats. A <0,2,1> region accomplishes this.

v2:
- Apply region <0,2,1> in brw_reg_from_fs_reg() (Curro).

v3:
- Added comment explaining the reason (Curro).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Samuel Iglesias Gonsálvez 82d17615f4 i965/fs: clamp exec_size when an instruction has a scalar DF source
Then the SIMD lowering pass will get rid of any compressed instructions with scalar
source (whether force_writemask_all or not) and we avoid hitting the Gen7 region
decompression bug.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Suggested-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Juan A. Suarez Romero 0f1316d4db i965/fs: double regioning parameters and execsize for DF in IVB/BYT
In IVB and BYT, both regioning parameters and execution sizes are measured as
32-bits element size.

So when we have something like:

mov(8) g2<1>DF g3<4,4,1>DF

We are not actually moving 8 doubles (our intention), but 4 doubles.

We need to double the parameters to cope with this issue. However,
horizontal strides don't behave as they're supposed to on IVB
for DF regions, they will cause each 32-bit half of DF sources to be
strided individually, and doubling the value won't make any difference.

v2:
- Use devinfo directly (Matt).
- Use Baytrail instead of Valleview (Matt).
- Use IvyBridge instead of Ivy (Matt)
- Double the exec_size in code emission (Curro)

v3:
- Change hstride doubling by an assert and fix commit log (Curro).
- Substitute remaining compiler->devinfo by devinfo (Curro).

v4:
- Fix comment (Curro).

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Juan A. Suarez Romero 79af256388 i965/fs: add helper to retrieve instruction execution type
The execution data size is the biggest type size of any instruction
operand.

We will use it to know if the instruction deals with DF, because in Ivy
we need to double the execution size and regioning parameters.

v2:
- Fix typo in commit log (Matt)
- Use static inline function instead of fs_inst's method (Curro).
- Define the result as a constant (Curro).
- Fix indentation (Matt).
- Add braces to nested control flow (Matt).

v3 (Curro):
- Add get_exec_type() and other auxiliary functions and use them to
  calculate its size.

Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
[ Francisco Jerez: Fix bogus 'type != BAD_FILE' check.  Fix deduced
  execution type for integer vector types.  Take destination type as
  execution type where there is no valid source.  Assert-fail if the
  deduced execution type is byte.  Move into brw_ir_fs.h header for
  consistency with the VEC4 back-end. ]
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:07 -07:00
Matt Turner fd349d29e4 i965: Handle IVB DF differences in the validator.
On IVB/BYT, region parameters and execution size for DF are in terms of
32-bit elements, so they are doubled. For evaluating the validity of an
instruction, we halve them.

v2 (Sam):
- Add comments.

Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
2017-04-14 14:56:07 -07:00
Iago Toral Quiroga fbac8b1f94 i965/disasm: also print nibctrl in IVB for execsize=8
4-wide DF operations where NibCtrl applies require and execsize of 8
in IvyBridge/BayTrail.

v2:
- Refactor NibCtrl printing (Matt)

Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
2017-04-14 14:56:06 -07:00
Francisco Jerez 147e71242c i965/fs: Take into account lower frequency of conditional blocks in spilling cost heuristic.
The individual branches of an if/else/endif construct will be executed
some unknown number of times between 0 and 1 relative to the parent
block.  Use some factor in between as weight while approximating the
cost of spill/fill instructions within a conditional if-else branch.
This favors spilling registers used within conditional branches which
are likely to be executed less frequently than registers used at the
top level.

Improves the framerate of the SynMark2 OglCSDof benchmark by ~1.9x on
my SKL GT4e.  Should have a comparable effect on other platforms.  No
significant regressions.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-04-11 15:28:54 -07:00
Jason Ekstrand 3503b2714b i965/fs: Always provide a default LOD of 0 for TXS and TXL
We already provide a default LOD for textureQueryLevels and texture() on
non-fragment stages.  However, there are more cases where one is needed
such as textureSize(gsampler2DMS*) in SPIR-V.  Instead of trying to list
out all of the cases one at a time, just provide the default for all TXS
and TXL operations.  This fixes a shader validation error in the new
Sascha deferredmultisampling demo which uses textureSize(gsampler2DMS).

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100391
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
Cc: "13.0 17.0" <mesa-stable@lists.freedesktop.org>
2017-04-04 18:33:35 -07:00
Jason Ekstrand 405ef7bb33 intel/vec4: Add some fall through comments
Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-04-03 16:58:35 -07:00
Alejandro Piñeiro 2f8d6bd578 i965: expose BRW_OPCODE_[F32TO16/F16TO32] name on gen8+
Technically those hw operations are only available on gen7, as gen8+
support the conversion on the MOV. But, when using the builder to
implement nir operations (example: nir_op_fquantize2f16), it is not
needed to do the gen check. This check is done later, on the final
emission at brw_F32TO16 (brw_eu_emit), choosing between the MOV or the
specific operation accordingly.

So in the middle, during optimization phases those hw operations can
be around for gen8+ too.

Without this patch, several (at least 95) vulkan-cts quantize tests
crashes when using INTEL_DEBUG=optimizer. For example:
dEQP-VK.spirv_assembly.instruction.graphics.opquantize.too_small_vert

v2: simplify the code using GEN_GE (Ilia Mirkin)
v3: tweak brw_instruction_name instead of changing opcode_descs
    table, that is used for validation (Matt Turner)

Reviewed-by: Matt Turner <mattst88@gmail.com>
2017-03-29 17:34:15 +02:00
Matt Turner 7dccd38b40 i965/fs: Don't emit SEL instructions for type-converting MOVs.
SEL can only convert between a few integer types, which we basically
never do.

Fixes fs/vs-double-uniform-array-direct-indirect-non-uniform-control-flow
Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com>
Acked-by: Francisco Jerez <currojerez@riseup.net>
2017-03-27 10:59:42 -07:00
Iago Toral Quiroga ddb2bb3ed4 anv/pipeline: make FragCoord include sample positions when sample shading
We need to know if sample shading has been requested during shader
compilation since that affects the way fragment coordinates are
computed.

Notice that the semantics of fragment coordinates only depend on
whether sample shading has been requested, not on whether more
than one sample will actually be produced (that is,
minSampleShading and rasterizationSamples do not affect this
behavior).

Because this setting affects the code we generate for the shader, we also
need to include it in the WM prog key. Notice we don't need to alter the
OpenGL code because it doesn't ever use this behavior, so they key's
value is always false (the default).

Fixes:
dEQP-VK.glsl.builtin_var.fragcoord_msaa.*

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-24 08:11:53 +01:00
Matt Turner 7499bc7fd7 i965: Replace OPT_V() with OPT().
We want to be able to check the progress of each pass and dump the NIR
for debugging purposes if it changed.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-23 14:34:44 -07:00
Matt Turner 1be91bd9d8 i965/fs: Return progress from demote_sample_qualifiers().
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-23 14:34:44 -07:00
Matt Turner fd3351246c i965/fs: Return progress from move_interpolation_to_top().
And mark as static at the same time.

Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-23 14:34:44 -07:00
Emil Velikov 2438c0a236 intel/compiler: consistently use ifndef guards over pragma once
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Vedran Miletić <vedran@miletic.net>
Acked-by: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
2017-03-22 16:55:22 +00:00
Emil Velikov 3b277bae66 i965: make brw_setup_image_uniform_values static
Used only internally.

Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Vedran Miletić <vedran@miletic.net>
Acked-by: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
2017-03-22 16:55:21 +00:00
Jason Ekstrand 762a6333f2 nir: Rework conversion opcodes
The NIR story on conversion opcodes is a mess.  We've had way too many
of them, naming is inconsistent, and which ones have explicit sizes was
sort-of random.  This commit re-organizes things and makes them all
consistent:

 - All non-bool conversion opcodes now have the explicit size in the
   destination and are named <src_type>2<dst_type><size>.

 - Integer <-> integer conversion opcodes now only come in i2i and u2u
   forms (i2u and u2i have been removed) since the only difference
   between the different integer conversions is whether or not they
   sign-extend when up-converting.

 - Boolean conversion opcodes all have the explicit size on the bool and
   are named <src_type>2<dst_type>.

Making things consistent also allows nir_type_conversion_op to be moved
to nir_opcodes.c and auto-generated using mako.  This will make adding
int8, int16, and float16 versions much easier when the time comes.

Reviewed-by: Eric Anholt <eric@anholt.net>
2017-03-14 07:36:40 -07:00
Jason Ekstrand 7107b32155 i965/fs: Re-arrange conversion operations
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-03-14 07:36:40 -07:00
Jason Ekstrand bab4610e9c i965/vec4: Get rid of the type parameter from to/from_double
Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
2017-03-14 07:36:40 -07:00
Jason Ekstrand b377be9213 i965/fs: Use num_components from the SSA def in image intrinsics
Reviewed-by: Eric Anholt <eric@anholt.net>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
2017-03-14 07:36:40 -07:00
Iago Toral Quiroga e8eeb759b7 intel: fix compiler build
compiler/brw_vec4_gs_visitor.cpp:744:39: error:
‘GEN7_MAX_GS_OUTPUT_VERTEX_SIZE_BYTES’ was not declared in this scope
           output_vertex_size_bytes <= GEN7_MAX_GS_OUTPUT_VERTEX_SIZE_BYTES);

Fixes: d0d4a5f43b ("i965: split EU defines to brw_eu_defines.h")
Reviewed-by: Emil Velikov <emil.velikov@collabora.com>
2017-03-13 13:09:24 +01:00
Emil Velikov aa09c9552c intel/compiler: whitespace cleanups
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-13 11:16:35 +00:00
Emil Velikov bdc5036464 intel/compiler: link all tests again gtest, even test_eu_compact"
At the moment all the tests but test_eu_compact are actual C++ gtests.
To simplify things, we can move the gtest.la to the common TEST_LIBS.
As we're here, we can rename change the test extension [to .cpp] to
avoid using the confusing dummy.cpp.

Add a nice comment in the makefile for posterity.

Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-13 11:16:35 +00:00
Jason Ekstrand 700bebb958 i965: Move the back-end compiler to src/intel/compiler
Mostly a dummy git mv with a couple of noticable parts:
 - With the earlier header cleanups, nothing in src/intel depends
files from src/mesa/drivers/dri/i965/
 - Both Autoconf and Android builds are addressed. Thanks to Mauro and
Tapani for the fixups in the latter
 - brw_util.[ch] is not really compiler specific, so it's moved to i965.

v2:
 - move brw_eu_defines.h instead of brw_defines.h
 - remove no-longer applicable includes
 - add missing vulkan/ prefix in the Android build (thanks Tapani)

v3:
 - don't list brw_defines.h in src/intel/Makefile.sources (Jason)
 - rebase on top of the oa patches

[Emil Velikov: commit message, various small fixes througout]
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
2017-03-13 11:16:34 +00:00