intel/fs: Opportunistically split SEND message payloads

While we've taken advantage of split-sends in select situations, there
are many other cases (such as sampler messages, framebuffer writes, and
URB writes) that have never received that treatment, and continued to
use monolithic send payloads.

This commit introduces a new optimization pass which detects SEND
messages with a single payload, finds an adjacent LOAD_PAYLOAD that
produces that payload, splits it two, and updates the SEND to use both
of the new smaller payloads.

In places where we manually used split SENDS, we rely on underlying
knowledge of the message to determine a natural split point.  For
example, header and data, or address and value.

In this pass, we instead infer a natural split point by looking at the
source registers.  Often times, consecutive LOAD_PAYLOAD sources may
already be grouped together in a contiguous block, such as a texture
coordinate.  Then, there is another bit of data, such as a LOD, that
may come from elsewhere.  We look for the point where the source list
switches VGRFs, and split it there.  (If there is a message header, we
choose to split there, as it will naturally come from elsewhere.)

This not only reduces the payload sizes, alleviating register pressure,
but it means that we may be able to eliminate some payload construction
altogether, if we have a contiguous block already and some extra data
being tacked on to one side or the other.

shader-db results for Icelake are:

   total instructions in shared programs: 19602513 -> 19369255 (-1.19%)
   instructions in affected programs: 6085404 -> 5852146 (-3.83%)
   helped: 23650 / HURT: 15
   helped stats (abs) min: 1 max: 1344 x̄: 9.87 x̃: 3
   helped stats (rel) min: 0.03% max: 35.71% x̄: 3.78% x̃: 2.15%
   HURT stats (abs)   min: 1 max: 44 x̄: 7.20 x̃: 2
   HURT stats (rel)   min: 1.04% max: 20.00% x̄: 4.13% x̃: 2.00%
   95% mean confidence interval for instructions value: -10.16 -9.55
   95% mean confidence interval for instructions %-change: -3.84% -3.72%
   Instructions are helped.

   total cycles in shared programs: 848180368 -> 842208063 (-0.70%)
   cycles in affected programs: 599931746 -> 593959441 (-1.00%)
   helped: 22114 / HURT: 13053
   helped stats (abs) min: 1 max: 482486 x̄: 580.94 x̃: 22
   helped stats (rel) min: <.01% max: 78.92% x̄: 4.76% x̃: 0.75%
   HURT stats (abs)   min: 1 max: 94022 x̄: 526.67 x̃: 22
   HURT stats (rel)   min: <.01% max: 188.99% x̄: 4.52% x̃: 0.61%
   95% mean confidence interval for cycles value: -222.87 -116.79
   95% mean confidence interval for cycles %-change: -1.44% -1.20%
   Cycles are helped.

   total spills in shared programs: 8387 -> 6569 (-21.68%)
   spills in affected programs: 5110 -> 3292 (-35.58%)
   helped: 359 / HURT: 3

   total fills in shared programs: 11833 -> 8218 (-30.55%)
   fills in affected programs: 8635 -> 5020 (-41.86%)
   helped: 358 / HURT: 3

   LOST:   1 SIMD16 shader, 659 SIMD32 shaders
   GAINED: 65 SIMD16 shaders, 959 SIMD32 shaders

   Total CPU time (seconds): 1505.48 -> 1474.08 (-2.09%)

Examining these results: the few shaders where spills/fills increased
were already spilling significantly, and were only slightly hurt.  The
applications affected were also helped in countless other shaders, and
other shaders stopped spilling altogether or had 50% reductions.  Many
SIMD16 shaders were gained, and overall we gain more SIMD32, though many
close to the register pressure line go back and forth.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17018>
This commit is contained in:
Kenneth Graunke 2022-06-13 02:21:49 -07:00 committed by Marge Bot
parent a8b93e628a
commit 589b03d02f
2 changed files with 95 additions and 0 deletions

View File

@ -2888,6 +2888,98 @@ fs_visitor::opt_zero_samples()
return progress;
}
/**
* Opportunistically split SEND message payloads.
*
* Gfx9+ supports "split" SEND messages, which take two payloads that are
* implicitly concatenated. If we find a SEND message with a single payload,
* we can split that payload in two. This results in smaller contiguous
* register blocks for us to allocate. But it can help beyond that, too.
*
* We try and split a LOAD_PAYLOAD between sources which change registers.
* For example, a sampler message often contains a x/y/z coordinate that may
* already be in a contiguous VGRF, combined with an LOD, shadow comparitor,
* or array index, which comes from elsewhere. In this case, the first few
* sources will be different offsets of the same VGRF, then a later source
* will be a different VGRF. So we split there, possibly eliminating the
* payload concatenation altogether.
*/
bool
fs_visitor::opt_split_sends()
{
if (devinfo->ver < 9)
return false;
bool progress = false;
const fs_live_variables &live = live_analysis.require();
int next_ip = 0;
foreach_block_and_inst_safe(block, fs_inst, send, cfg) {
int ip = next_ip;
next_ip++;
if (send->opcode != SHADER_OPCODE_SEND ||
send->mlen == 1 || send->ex_mlen > 0)
continue;
/* Don't split payloads which are also read later. */
assert(send->src[2].file == VGRF);
if (live.vgrf_end[send->src[2].nr] > ip)
continue;
fs_inst *lp = (fs_inst *) send->prev;
if (lp->is_head_sentinel() || lp->opcode != SHADER_OPCODE_LOAD_PAYLOAD)
continue;
if (lp->dst.file != send->src[2].file || lp->dst.nr != send->src[2].nr)
continue;
/* Split either after the header (if present), or when consecutive
* sources switch from one VGRF to a different one.
*/
unsigned i = lp->header_size;
if (lp->header_size == 0) {
for (i = 1; i < lp->sources; i++) {
if (lp->src[i].file == BAD_FILE)
continue;
if (lp->src[0].file != lp->src[i].file ||
lp->src[0].nr != lp->src[i].nr)
break;
}
}
if (i != lp->sources) {
const fs_builder ibld(this, block, lp);
fs_inst *lp2 =
ibld.LOAD_PAYLOAD(lp->dst, &lp->src[i], lp->sources - i, 0);
lp->resize_sources(i);
lp->size_written -= lp2->size_written;
lp->dst = fs_reg(VGRF, alloc.allocate(lp->size_written / REG_SIZE), lp->dst.type);
lp2->dst = fs_reg(VGRF, alloc.allocate(lp2->size_written / REG_SIZE), lp2->dst.type);
send->resize_sources(4);
send->src[2] = lp->dst;
send->src[3] = lp2->dst;
send->ex_mlen = lp2->size_written / REG_SIZE;
send->mlen -= send->ex_mlen;
progress = true;
}
}
if (progress)
invalidate_analysis(DEPENDENCY_INSTRUCTIONS | DEPENDENCY_VARIABLES);
return progress;
}
bool
fs_visitor::opt_register_renaming()
{
@ -8583,6 +8675,8 @@ fs_visitor::optimize()
OPT(lower_logical_sends);
/* After logical SEND lowering. */
OPT(opt_copy_propagation);
OPT(opt_split_sends);
OPT(fixup_nomask_control_flow);
if (progress) {

View File

@ -172,6 +172,7 @@ public:
bool opt_drop_redundant_mov_to_flags();
bool opt_register_renaming();
bool opt_bank_conflicts();
bool opt_split_sends();
bool register_coalesce();
bool compute_to_mrf();
bool eliminate_find_live_channel();