broadcom/compiler: don't try to hide TMU latency at QPU scheduling

Based on empirical testing with Sponza and a few UE4 samples this is consistently slightly benefitial for performance. The most likely reason why this helps is that thrsw is probably already quite effective at hiding latency and we are already trying to hide latency at NIR scheduling and also via TMU pipelining, so piling up on this when scheduling QPU typically ends up providing no benefit at all for latency and is instead possibly preventing us to unblock critical paths in the shader that depend on the TMU result, requiring us to execute more cycles to complete the program. Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17451>
2022-07-08 13:33:11 +02:00 · 2022-07-08 13:33:11 +02:00 · f227aa7c98
parent 66d46a23fb
commit f227aa7c98
1 changed files with 13 additions and 0 deletions
--- a/src/broadcom/compiler/qpu_schedule.c
+++ b/src/broadcom/compiler/qpu_schedule.c
@ -645,19 +645,32 @@ get_instruction_priority(const struct v3d_device_info *devinfo,
                return next_score;
        next_score++;

+        /* Empirical testing shows that using priorities to hide latency of
+         * TMU operations when scheduling QPU leads to slightly worse
+         * performance, even at 2 threads. We think this is because the thread
+         * switching is already quite effective at hiding latency and NIR
+         * scheduling (and possibly TMU pipelining too) are sufficient to hide
+         * TMU latency, so piling up on that here doesn't provide any benefits
+         * and instead may cause us to postpone critical paths that depend on
+         * the TMU results.
+         */
+#if 0
        /* Schedule texture read results collection late to hide latency. */
        if (v3d_qpu_waits_on_tmu(inst))
                return next_score;
        next_score++;
+#endif

        /* Default score for things that aren't otherwise special. */
        baseline_score = next_score;
        next_score++;

+#if 0
        /* Schedule texture read setup early to hide their latency better. */
        if (v3d_qpu_writes_tmu(devinfo, inst))
                return next_score;
        next_score++;
+#endif

        /* We should increase the maximum if we assert here */
        assert(next_score < MAX_SCHEDULE_PRIORITY);