intel/fs: Opportunistically split SEND message payloads

While we've taken advantage of split-sends in select situations, there are many other cases (such as sampler messages, framebuffer writes, and URB writes) that have never received that treatment, and continued to use monolithic send payloads. This commit introduces a new optimization pass which detects SEND messages with a single payload, finds an adjacent LOAD_PAYLOAD that produces that payload, splits it two, and updates the SEND to use both of the new smaller payloads. In places where we manually used split SENDS, we rely on underlying knowledge of the message to determine a natural split point. For example, header and data, or address and value. In this pass, we instead infer a natural split point by looking at the source registers. Often times, consecutive LOAD_PAYLOAD sources may already be grouped together in a contiguous block, such as a texture coordinate. Then, there is another bit of data, such as a LOD, that may come from elsewhere. We look for the point where the source list switches VGRFs, and split it there. (If there is a message header, we choose to split there, as it will naturally come from elsewhere.) This not only reduces the payload sizes, alleviating register pressure, but it means that we may be able to eliminate some payload construction altogether, if we have a contiguous block already and some extra data being tacked on to one side or the other. shader-db results for Icelake are: total instructions in shared programs: 19602513 -> 19369255 (-1.19%) instructions in affected programs: 6085404 -> 5852146 (-3.83%) helped: 23650 / HURT: 15 helped stats (abs) min: 1 max: 1344 x̄: 9.87 x̃: 3 helped stats (rel) min: 0.03% max: 35.71% x̄: 3.78% x̃: 2.15% HURT stats (abs) min: 1 max: 44 x̄: 7.20 x̃: 2 HURT stats (rel) min: 1.04% max: 20.00% x̄: 4.13% x̃: 2.00% 95% mean confidence interval for instructions value: -10.16 -9.55 95% mean confidence interval for instructions %-change: -3.84% -3.72% Instructions are helped. total cycles in shared programs: 848180368 -> 842208063 (-0.70%) cycles in affected programs: 599931746 -> 593959441 (-1.00%) helped: 22114 / HURT: 13053 helped stats (abs) min: 1 max: 482486 x̄: 580.94 x̃: 22 helped stats (rel) min: <.01% max: 78.92% x̄: 4.76% x̃: 0.75% HURT stats (abs) min: 1 max: 94022 x̄: 526.67 x̃: 22 HURT stats (rel) min: <.01% max: 188.99% x̄: 4.52% x̃: 0.61% 95% mean confidence interval for cycles value: -222.87 -116.79 95% mean confidence interval for cycles %-change: -1.44% -1.20% Cycles are helped. total spills in shared programs: 8387 -> 6569 (-21.68%) spills in affected programs: 5110 -> 3292 (-35.58%) helped: 359 / HURT: 3 total fills in shared programs: 11833 -> 8218 (-30.55%) fills in affected programs: 8635 -> 5020 (-41.86%) helped: 358 / HURT: 3 LOST: 1 SIMD16 shader, 659 SIMD32 shaders GAINED: 65 SIMD16 shaders, 959 SIMD32 shaders Total CPU time (seconds): 1505.48 -> 1474.08 (-2.09%) Examining these results: the few shaders where spills/fills increased were already spilling significantly, and were only slightly hurt. The applications affected were also helped in countless other shaders, and other shaders stopped spilling altogether or had 50% reductions. Many SIMD16 shaders were gained, and overall we gain more SIMD32, though many close to the register pressure line go back and forth. Reviewed-by: Francisco Jerez <currojerez@riseup.net> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17018>
2022-06-13 02:21:49 -07:00 · 2022-06-13 02:21:49 -07:00 · 589b03d02f
parent a8b93e628a
commit 589b03d02f
2 changed files with 95 additions and 0 deletions
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@ -2888,6 +2888,98 @@ fs_visitor::opt_zero_samples()
   return progress;
 }

+/**
+ * Opportunistically split SEND message payloads.
+ *
+ * Gfx9+ supports "split" SEND messages, which take two payloads that are
+ * implicitly concatenated.  If we find a SEND message with a single payload,
+ * we can split that payload in two.  This results in smaller contiguous
+ * register blocks for us to allocate.  But it can help beyond that, too.
+ *
+ * We try and split a LOAD_PAYLOAD between sources which change registers.
+ * For example, a sampler message often contains a x/y/z coordinate that may
+ * already be in a contiguous VGRF, combined with an LOD, shadow comparitor,
+ * or array index, which comes from elsewhere.  In this case, the first few
+ * sources will be different offsets of the same VGRF, then a later source
+ * will be a different VGRF.  So we split there, possibly eliminating the
+ * payload concatenation altogether.
+ */
+bool
+fs_visitor::opt_split_sends()
+{
+   if (devinfo->ver < 9)
+      return false;
+
+   bool progress = false;
+
+   const fs_live_variables &live = live_analysis.require();
+
+   int next_ip = 0;
+
+   foreach_block_and_inst_safe(block, fs_inst, send, cfg) {
+      int ip = next_ip;
+      next_ip++;
+
+      if (send->opcode != SHADER_OPCODE_SEND ||
+          send->mlen == 1 || send->ex_mlen > 0)
+         continue;
+
+      /* Don't split payloads which are also read later. */
+      assert(send->src[2].file == VGRF);
+      if (live.vgrf_end[send->src[2].nr] > ip)
+         continue;
+
+      fs_inst *lp = (fs_inst *) send->prev;
+
+      if (lp->is_head_sentinel() || lp->opcode != SHADER_OPCODE_LOAD_PAYLOAD)
+         continue;
+
+      if (lp->dst.file != send->src[2].file || lp->dst.nr != send->src[2].nr)
+         continue;
+
+      /* Split either after the header (if present), or when consecutive
+       * sources switch from one VGRF to a different one.
+       */
+      unsigned i = lp->header_size;
+      if (lp->header_size == 0) {
+         for (i = 1; i < lp->sources; i++) {
+            if (lp->src[i].file == BAD_FILE)
+               continue;
+
+            if (lp->src[0].file != lp->src[i].file ||
+                lp->src[0].nr != lp->src[i].nr)
+               break;
+         }
+      }
+
+      if (i != lp->sources) {
+         const fs_builder ibld(this, block, lp);
+         fs_inst *lp2 =
+            ibld.LOAD_PAYLOAD(lp->dst, &lp->src[i], lp->sources - i, 0);
+
+         lp->resize_sources(i);
+         lp->size_written -= lp2->size_written;
+
+         lp->dst = fs_reg(VGRF, alloc.allocate(lp->size_written / REG_SIZE), lp->dst.type);
+         lp2->dst = fs_reg(VGRF, alloc.allocate(lp2->size_written / REG_SIZE), lp2->dst.type);
+
+         send->resize_sources(4);
+         send->src[2] = lp->dst;
+         send->src[3] = lp2->dst;
+         send->ex_mlen = lp2->size_written / REG_SIZE;
+         send->mlen -= send->ex_mlen;
+
+         progress = true;
+      }
+   }
+
+   if (progress)
+      invalidate_analysis(DEPENDENCY_INSTRUCTIONS | DEPENDENCY_VARIABLES);
+
+   return progress;
+}
+
+
 bool
 fs_visitor::opt_register_renaming()
 {
@ -8583,6 +8675,8 @@ fs_visitor::optimize()
   OPT(lower_logical_sends);

   /* After logical SEND lowering. */
+   OPT(opt_copy_propagation);
+   OPT(opt_split_sends);
   OPT(fixup_nomask_control_flow);

   if (progress) {
--- a/src/intel/compiler/brw_fs.h
+++ b/src/intel/compiler/brw_fs.h
@ -172,6 +172,7 @@ public:
   bool opt_drop_redundant_mov_to_flags();
   bool opt_register_renaming();
   bool opt_bank_conflicts();
+   bool opt_split_sends();
   bool register_coalesce();
   bool compute_to_mrf();
   bool eliminate_find_live_channel();