Commit Graph

257 Commits

Author SHA1 Message Date
Marek Olšák 43f0a10051 radeonsi: add triple into si_compiler
Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Tested-by: Benedikt Schemmer <ben at besd.de>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-04-27 17:56:04 -04:00
Marek Olšák 87eb597758 radeonsi: add struct si_compiler containing LLVMTargetMachineRef
It will contain more variables.

Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Tested-by: Benedikt Schemmer <ben at besd.de>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-04-27 17:56:04 -04:00
Marek Olšák a8abbbb172 radeonsi: remove r600_pipe_common.h
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-04-27 17:56:04 -04:00
Marek Olšák 4c5efc40f4 radeonsi: update copyrights
Acked-by: Timothy Arceri <tarceri@itsqueeze.com>
2018-04-05 15:34:58 -04:00
Marek Olšák 2be6143032 radeonsi: implement GL_KHR_blend_equation_advanced
MSAA is supported using sample shading. Layered rendering and all texture
targets are also supported.

Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
2018-04-02 13:55:25 -04:00
Marek Olšák 9b7db12815 radeonsi: remove chip_class parameter from si_lower_nir
We can get it from si_screen.

Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
2018-03-08 14:58:16 -05:00
Timothy Arceri 70190a6567 radeonsi/nir: call ac_lower_indirect_derefs()
Fixes piglit tests:
tests/spec/glsl-1.50/execution/variable-indexing/gs-input-array-vec3-index-rd.shader_test
tests/spec/glsl-1.50/execution/geometry/max-input-components.shader_test

Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
2018-03-05 14:09:23 +11:00
Timothy Arceri 561503e3bd radeonsi: add chip class to compiler_ctx_state
This will be used in the following patch.

Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
2018-03-05 14:09:23 +11:00
Marek Olšák 8799eaed99 radeonsi: remove 2 unused user SGPRs from merged TES-GS with 32-bit pointers
The effect of the last 13 commits on user SGPR counts:

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-02-26 12:01:19 +01:00
Marek Olšák 3fa7a59d69 radeonsi: make SI_SGPR_VERTEX_BUFFERS the last user SGPR input
so that it can be removed and replaced with inline VBO descriptors,
and the pointer can be packed in unused bits of VBO descriptors.
This also removes the pointer from merged TES-GS where it's useless.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-02-26 12:01:08 +01:00
Marek Olšák 2d03c4cac8 radeonsi: move tess ring address into TCS_OUT_LAYOUT, removes 2 TCS user SGPRs
TCS_OUT_LAYOUT has 13 unused bits. That's enough for a 32-bit address
aligned to 512KB. Hey, it's a 13-bit pointer!

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-02-24 23:08:29 +01:00
Marek Olšák 190e064e63 radeonsi: move 2nd-shader descriptor pointers into s[0:1]
If 32-bit pointers are supported, both pointers can be moved into s[0:1]
and then ESGS has exactly the same user data SGPR declarations as VS.

If 32-bit pointers are not supported, only one pointer can be moved into
s[0:1]. In that case, the 2nd pointer is moved before TCS constants,
so that the location is the same in HS and GS.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-02-24 23:08:29 +01:00
Marek Olšák 931ec80eeb radeonsi: implement 32-bit pointers in user data SGPRs (v2)
User SGPRs changes:
    VS:     14 ->  9
    TCS:    14 -> 10
    TES:    10 ->  6
    GS:      8 ->  4
    GSCOPY:  2 ->  1
    PS:      9 ->  5
    Merged VS-TCS: 24 -> 16
    Merged VS-GS:  18 -> 11
    Merged TES-GS: 18 -> 11

SGPRS: 2170102 -> 2158430 (-0.54 %)
VGPRS: 1645656 -> 1641516 (-0.25 %)
Spilled SGPRs: 9078 -> 8810 (-2.95 %)
Spilled VGPRs: 130 -> 114 (-12.31 %)
Scratch size: 1508 -> 1492 (-1.06 %) dwords per thread
Code Size: 52094872 -> 52692540 (1.15 %) bytes
Max Waves: 371848 -> 372723 (0.24 %)

v2: - the shader cache needs to take address32_hi into account
    - set amdgpu-32bit-address-high-bits

Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com> (v1)
2018-02-17 04:52:17 +01:00
Marek Olšák 148b48646b radeonsi: print shader-db stats for main parts, not final binaries
This is needed to get shader-db stats for LS,HS,ES,GS stages on gfx9.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-01-31 03:21:20 +01:00
Marek Olšák c02c9ee550 radeonsi: move max_simd_waves computation into a separate function
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2018-01-31 03:21:20 +01:00
Timothy Arceri 452586b56a radeonsi: add dummy implementation of si_nir_scan_tess_ctrl()
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2018-01-05 11:58:55 +11:00
Samuel Pitoiset 45872a0a6d radeonsi: make use of ac_get_spi_shader_z_format()
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
2017-12-14 22:23:25 +01:00
Nicolai Hähnle 239d2b5809 radeonsi: clarify that si_shader_selector::esgs_itemsize is set for the ES part
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-28 09:34:43 +01:00
Nicolai Hähnle df5ebe0c26 radeonsi/gfx9: fix VM fault with fetched instance divisors
We need to account for SGPR locations in merged shaders.

This case is exercised by KHR-GL45.enhanced_layouts.vertex_attrib_locations

Fixes: 79c2e7388c ("radeonsi/gfx9: use SPI_SHADER_USER_DATA_COMMON")
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-20 16:26:10 +01:00
Nicolai Hähnle 4f493c79ee radeonsi: use ready fences on all shaders, not just optimized ones
There's a race condition between si_shader_select_with_key and
si_bind_XX_shader:

  Thread 1                         Thread 2
  --------                         --------
  si_shader_select_with_key
    begin compiling the first
    variant
    (guarded by sel->mutex)
                                   si_bind_XX_shader
                                     select first_variant by default
                                     as state->current
                                   si_shader_select_with_key
                                     match state->current and early-out

Since thread 2 never takes sel->mutex, it may go on rendering without a
PM4 for that shader, for example.

The solution taken by this patch is to broaden the scope of
shader->optimized_ready to a fence shader->ready that applies to
all shaders. This does not hurt the fast path (if anything it makes
it faster, because we don't explicitly check is_optimized).

It will also allow reducing the scope of sel->mutex locks, but this is
deferred to a later commit for better bisectability.

Fixes dEQP-EGL.functional.sharing.gles2.multithread.simple.buffers.bufferdata_render

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-11-09 11:37:51 +01:00
Marek Olšák 529cdce799 radeonsi: remove 'Authors:' comments
It's inaccurate. Instead, see the copyright and use "git log" and
"git blame" to know the authorship.

Acked-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-11-02 18:19:03 +01:00
Marek Olšák da0083f123 radeonsi: use postponed KILL only when derivatives are used
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-10-24 14:56:34 +02:00
Marek Olšák 2f4705afde radeonsi: if there's just const buffer 0, set it in place of CONST/SSBO pointer
SI_SGPR_CONST_AND_SHADER_BUFFERS now contains the pointer to const buffer 0
if there is no other buffer there.

Benefits:
- there is no constbuf descriptor upload and shader load

It's assumed that all constant addresses are within bounds. Non-constant
addresses are clamped against the last declared CONST variable.
This only works if the state tracker ensures the bound constant buffer
matches what the shader needs.

Once we get 32-bit pointers, we can only do this for user constant buffers
where the driver is in charge of the upload so that it can guarantee a 32-bit
address.

The real performance benefit might not be measurable.

These apps get 100% theoretical benefit in all shaders (except where noted):
- antichamber
- barman arkham origins
- borderlands 2
- borderlands pre-sequel
- brutal legend
- civilization BE
- CS:GO
- deadcore
- dota 2 -- most shaders
- europa universalis
- grid autosport -- most shaders
- left 4 dead 2
- legend of grimrock
- life is strange
- payday 2
- portal
- rocket league
- serious sam 3 bfe
- talos principle
- team fortress 2
- thea
- unigine heaven
- unigine valley -- also sanctuary and tropics
- wasteland 2
- xcom: enemy unknown & enemy within
- tesseract
- unity (engine)

Changed stats only:
    SGPRS: 2059998 -> 2086238 (1.27 %)
    VGPRS: 1626888 -> 1626904 (0.00 %)
    Spilled SGPRs: 7902 -> 7865 (-0.47 %)
    Code Size: 60924520 -> 60982660 (0.10 %) bytes
    Max Waves: 374539 -> 374526 (-0.00 %)

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-10-17 22:03:03 +02:00
Marek Olšák 6a8401a94e radeonsi: add VS blit shader creation
no users yet

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-10-07 18:26:35 +02:00
Nicolai Hähnle e4af4433fc radeonsi: hard-code pixel center for interpolateAtSample without multisample buffers
The GLSL rules for interpolateAtSample are unfortunate:

   "Returns the value of the input interpolant variable at
    the location of sample number sample. If
    multisample buffers are not available, the input
    variable will be evaluated at the center of the pixel.
    If sample sample does not exist, the position used to
    interpolate the input variable is undefined."

This fix will fallback to monolithic shader compilation when
interpolateAtSample is used without multisampling.

One alternative would be to always upload 16 sample positions,
filling the buffer up with repetition when the actual number of
samples is less, and then ANDing the sample ID with 0xf. However,
that punishes all well-behaving users of interpolateAtSample,
when in reality, only conformance tests should be affected by
the issue.

Fixes
dEQP-GLES31.functional.shaders.multisample_interpolation.interpolate_at_sample.non_multisample_buffer.*

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-09-13 18:25:45 +02:00
Nicolai Hähnle 92c4277990 radeonsi: apply a mask to gl_SampleMaskIn in the PS prolog
gl_SampleMaskIn is supposed to contain set bits only for the samples that
are covered by the current fragment shader invocation, but the VGPR
initialization hardware loads the set of all bits that are covered at the
current pixel.

Fixes various tests in
dEQP-GLES31.functional.shaders.sample_variables.sample_mask_in.*

Cc: mesa-stable@lists.freedesktop.org
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-09-13 18:25:41 +02:00
Marek Olšák 6eade342eb radeonsi: optimize TCS epilog when invocation 0 writes tess factors
This removes the barrier and LDS stores and loads for tess factors
when it's possible. The removal of the barrier seems more important
to me though.

In one shader, it removes 17 * 4 bytes from the shader binary.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-09-11 19:02:02 +02:00
Nicolai Hähnle 45c5c44451 radeonsi/gfx9: proper workaround for LS/HS VGPR initialization bug
When the HS wave is empty, the hardware writes the LS VGPRs starting at
v0 instead of v2. Workaround by shifting them back into place when
necessary. For simplicity, this is always done in the LS prolog.

According to the hardware team, this will be fixed in future chips,
so take that into account already.

Note that this is not a bug fix, as the bug was already worked
around by commit 166823bfd2 ("radeonsi/gfx9: add a temporary workaround
for a tessellation driver bug"). This change merely replaces the
workaround by one that should be better.

v2: add workaround code to shader only when necessary
v3: clarify the prefer_mono comment

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-09-06 10:02:49 +02:00
Samuel Pitoiset 781a13c475 radeonsi: declare new user SGPR indices for bindless samplers/images
A new pair of user SGPR is needed for loading the bindless
descriptors from shaders. Because the descriptors are global for
all stages, there is no need to add separate indices for GFX9.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-08-22 11:34:15 +02:00
Nicolai Hähnle 40697e8678 radeonsi: make si_shader_selector_reference globally visible
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-08-22 09:50:55 +02:00
Nicolai Hähnle b49c2c9fa3 radeonsi/nir: perform lowering of input/output driver locations
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-07-31 14:55:40 +02:00
Nicolai Hähnle 29d7bdd179 radeonsi: scan NIR shaders to obtain required info
v2: set num_instruction to 2, i.e. 1 + END (Marek)

Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-07-31 14:55:32 +02:00
Nicolai Hähnle 90b3ba8970 radeonsi: add si_shader_selector::nir
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-07-31 14:55:32 +02:00
Marek Olšák 4a10d6154e radeonsi: move instance divisors into a constant buffer
Shader key size: 107 -> 47

Divisors of 0 and 1 are encoded in the shader key. Greater instance divisors
are loaded from a constant buffer.

The shader code doing the division is huge. Is it something we need to
worry about? Does any app use instance divisors >= 2?

VS prolog disassembly:
    s_load_dwordx4 s[12:15], s[0:1], 0x80  ; C00A0300 00000080
    s_nop 0                                ; BF800000
    s_waitcnt lgkmcnt(0)                   ; BF8C007F
    s_buffer_load_dword s14, s[12:15], 0x4 ; C0220386 00000004
    s_waitcnt lgkmcnt(0)                   ; BF8C007F
    v_cvt_f32_u32_e32 v4, s14              ; 7E080C0E
    v_rcp_iflag_f32_e32 v4, v4             ; 7E084704
    v_mul_f32_e32 v4, 0x4f800000, v4       ; 0A0808FF 4F800000
    v_cvt_u32_f32_e32 v4, v4               ; 7E080F04
    v_mul_hi_u32 v5, v4, s14               ; D2860005 00001D04
    v_mul_lo_i32 v6, v4, s14               ; D2850006 00001D04
    v_cmp_eq_u32_e64 s[12:13], 0, v5       ; D0CA000C 00020A80
    v_sub_i32_e32 v5, vcc, 0, v6           ; 340A0C80
    v_cndmask_b32_e64 v5, v6, v5, s[12:13] ; D1000005 00320B06
    v_mul_hi_u32 v5, v5, v4                ; D2860005 00020905
    v_add_i32_e32 v6, vcc, v5, v4          ; 320C0905
    v_subrev_i32_e32 v4, vcc, v5, v4       ; 36080905
    v_cndmask_b32_e64 v4, v4, v6, s[12:13] ; D1000004 00320D04
    v_mul_hi_u32 v5, v4, v1                ; D2860005 00020304
    v_add_i32_e32 v4, vcc, s8, v0          ; 32080008
    v_mul_lo_i32 v6, v5, s14               ; D2850006 00001D05
    v_add_i32_e32 v7, vcc, 1, v5           ; 320E0A81
    v_cmp_ge_u32_e64 s[12:13], v1, v6      ; D0CE000C 00020D01
    v_sub_i32_e32 v6, vcc, v1, v6          ; 340C0D01
    v_cmp_le_u32_e32 vcc, s14, v6          ; 7D960C0E
    v_cndmask_b32_e64 v8, 0, -1, s[12:13]  ; D1000008 00318280
    v_cndmask_b32_e64 v6, 0, -1, vcc       ; D1000006 01A98280
    v_and_b32_e32 v6, v8, v6               ; 260C0D08
    v_cmp_eq_u32_e32 vcc, 0, v6            ; 7D940C80
    v_cndmask_b32_e32 v6, v7, v5, vcc      ; 000C0B07
    v_add_i32_e32 v5, vcc, -1, v5          ; 320A0AC1
    v_cmp_eq_u32_e32 vcc, 0, v8            ; 7D941080
    v_cndmask_b32_e32 v5, v6, v5, vcc      ; 000A0B06
    v_add_i32_e32 v5, vcc, s9, v5          ; 320A0A09

v2: set prefer_mono for fetched instance divisors

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-27 19:55:09 +02:00
Marek Olšák f9a7e7fe14 radeonsi: use #pragma pack to pack si_shader_key
sizeof(struct si_shader_key):
  Before reverting the 2 commits: 120 bytes
  After reverting the 2 commits: 128 bytes
  With #pragma pack: 107 bytes

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-27 18:45:07 +02:00
Marek Olšák 77d2a98353 Revert "radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputs"
This reverts commit 7b2240ac9c.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-27 18:45:07 +02:00
Marek Olšák dbe45e1180 Revert "radeonsi: remove 8 bytes from si_shader_key with uint32_t ff_tcs_inputs_to_copy"
This reverts commit 6b6fed3a3c.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-27 18:45:07 +02:00
Emil Velikov 1f958c1337 radeonsi: include ac_binary.h for struct ac_shader_binary
The header embeds the struct so it needs the header inclusion instead of
the dummy forward declaration.

Cc: Nicolai Hähnle <nicolai.haehnle@amd.com>
Cc: Marek Olšák <marek.olsak@amd.com>
Cc: Tom Stellard <tstellar@redhat.com>
Fixes: 32206c5e56 ("radeonsi: Add radeon_shader_binary member to struct
si_shader")
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
2017-06-17 11:38:02 +01:00
Samuel Pitoiset 2c3a7d5840 radeonsi: track use of bindless samplers/images from tgsi_shader_info
This adds some new helper functions to know if the current draw
call (or dispatch compute) is using bindless samplers/images,
based on TGSI analysis.

Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-06-14 10:04:36 +02:00
Marek Olšák 6b6fed3a3c radeonsi: remove 8 bytes from si_shader_key with uint32_t ff_tcs_inputs_to_copy
The previous patch helps with this.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-12 18:24:37 +02:00
Marek Olšák 7b2240ac9c radeonsi: use uint32_t to declare si_shader_key.opt.kill_outputs
the next patch will benefit from this

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-12 18:24:37 +02:00
Marek Olšák 1621b33d73 radeonsi: remove 8 bytes from si_shader_key by flattening opt.hw_vs
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-12 18:24:37 +02:00
Marek Olšák 2b7fd9df9a radeonsi: precompute some fields for PA_CL_VS_OUT_CNTL in si_shader_selector
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-07 20:17:18 +02:00
Marek Olšák e9409c86e7 radeonsi: remove 8 bytes from si_shader_key
We can use a union in si_shader_key::mono.

Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-07 20:17:06 +02:00
Marek Olšák 6e2c07749b radeonsi: move streamout state update out of si_update_shaders
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-06-07 18:43:42 +02:00
Marek Olšák 53c2ef36da radeonsi: record which descriptor slots are used by shaders
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-05-18 22:15:02 +02:00
Marek Olšák f07c15ef80 radeonsi: merge sampler and image descriptor lists into one
Sampler slots: slot[8], .. slot[39] (ascending)
Image slots: slot[7], .. slot[0] (descending)

Each image occupies 1/2 of each slot, so there are 16 images in total,
therefore the layout is: slot[15], .. slot[0]. (in 1/2 slot increments)

Updating image slot 2n+i (i <= 1) also dirties and re-uploads slot 2n+!i.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-05-18 22:15:02 +02:00
Marek Olšák 5df24c3fa6 radeonsi: merge constant and shader buffers descriptor lists into one
Constant buffers: slot[16], .. slot[31] (ascending)
Shader buffers: slot[15], .. slot[0] (descending)

The idea is that if we have 4 constant buffers and 2 shader buffers, we only
have to upload 6 slots. That optimization is left for a later commit.

Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
2017-05-18 22:15:02 +02:00
Nicolai Hähnle a16ae77185 radeonsi: get rid of secondary input/output word
By keeping track of fewer generics, everything can fit into 64 bits.

Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-05-12 10:46:06 +02:00
Nicolai Hähnle 0dd8aa44b3 radeonsi: reduce the number of generics for shader IO unique indices
This is a high as possible while still allowing to merge the bitfields
with the next commit.

For OpenGL, 32 would be sufficient. Nine apparently uses (much!) higher
indices than. Indices that are out of bound don't hurt for VS-PS
pipelines, except that the VS output kill optimization is not applied.

Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
2017-05-12 10:46:06 +02:00