iris: fix race condition during busy tracking

The Iris code that deals with implicit tracking is protected by
bufmgr->bo_deps_lock. Before this patch, we hold this lock during
update_batch_syncobjs() but don't keep it held until we actually
submit the batch in the execbuf ioctl. This can lead to the following
race condition:

  - Context C1 generates a batch B1 that signals syncobj S1.
  - Context C2 generates a batch B2 that depends on something that B1
    from C1 is using, so we mark B2 as having to wait syncobj S1.
  - C2 calls submit_batch() before C1 does it.
  - The Kernel detects it was told to wait on syncobj S1 that was
    never even submitted, so it returns EINVAL to the execbuf ioctl.
  - We run abort() at the end of _iris_batch_flush().
    - If DEBUG is defined, we also print:
      iris: Failed to submit batchbuffer: Invalid argument

I couldn't figure out a way to reproduce this issue with real
workloads, but I was able to write a small reproducer to trigger this.
Basically it's a little GL program that has lots of contexts running
in different threads submitting compute shaders that keep using the
same SSBOs. I'll submit this as a piglit test. Edit: Tapani found a
dEQP test case which fails intermintently without this fix, so I'm not
sure a new Piglit is worth it now.

The solution itself is quite simple: just keep bo_deps_lock held all
the way from update_batch_syncobjs() until ioctl(). In order to make
that easier we just call update_batch_syncobjs() a little later. We
have to drop the lock as soon as the ioctl returns because removing
the references on the buffers would trigger other functions to try to
grab the lock again, leading to deadlocks.

Thanks to Kenneth Graunke for pointing out this issue.

This has also been confirmed to fix a dEQP test that was giving
intermittent failures:
  dEQP-EGL.functional.sharing.gles2.multithread.random.images.copyteximage2d.12

v2: Move decode_batch() out, just to be safe (Jason).
v3: Do it all after assembling validation_list (Ken).

Cc: mesa-stable
Fixes: 89a34cb845 ("iris: switch to explicit busy tracking")
Tested-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand@collabora.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14964>
This commit is contained in:
Paulo Zanoni 2022-02-09 12:56:35 -08:00 committed by Marge Bot
parent 370a851ef0
commit 3532c374de
1 changed files with 16 additions and 12 deletions

View File

@ -882,11 +882,6 @@ update_bo_syncobjs(struct iris_batch *batch, struct iris_bo *bo, bool write)
static void
update_batch_syncobjs(struct iris_batch *batch)
{
struct iris_bufmgr *bufmgr = batch->screen->bufmgr;
simple_mtx_t *bo_deps_lock = iris_bufmgr_get_bo_deps_lock(bufmgr);
simple_mtx_lock(bo_deps_lock);
for (int i = 0; i < batch->exec_count; i++) {
struct iris_bo *bo = batch->exec_bos[i];
bool write = BITSET_TEST(batch->bos_written, i);
@ -896,7 +891,6 @@ update_batch_syncobjs(struct iris_batch *batch)
update_bo_syncobjs(batch, bo, write);
}
simple_mtx_unlock(bo_deps_lock);
}
/**
@ -905,6 +899,9 @@ update_batch_syncobjs(struct iris_batch *batch)
static int
submit_batch(struct iris_batch *batch)
{
struct iris_bufmgr *bufmgr = batch->screen->bufmgr;
simple_mtx_t *bo_deps_lock = iris_bufmgr_get_bo_deps_lock(bufmgr);
iris_bo_unmap(batch->bo);
struct drm_i915_gem_exec_object2 *validation_list =
@ -938,15 +935,22 @@ submit_batch(struct iris_batch *batch)
free(index_for_handle);
/* The decode operation may map and wait on the batch buffer, which could
* in theory try to grab bo_deps_lock. Let's keep it safe and decode
* outside the lock.
*/
if (INTEL_DEBUG(DEBUG_BATCH))
decode_batch(batch);
simple_mtx_lock(bo_deps_lock);
update_batch_syncobjs(batch);
if (INTEL_DEBUG(DEBUG_BATCH | DEBUG_SUBMIT)) {
dump_fence_list(batch);
dump_bo_list(batch);
}
if (INTEL_DEBUG(DEBUG_BATCH)) {
decode_batch(batch);
}
/* The requirement for using I915_EXEC_NO_RELOC are:
*
* The addresses written in the objects must match the corresponding
@ -984,6 +988,8 @@ submit_batch(struct iris_batch *batch)
intel_ioctl(batch->screen->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf))
ret = -errno;
simple_mtx_unlock(bo_deps_lock);
for (int i = 0; i < batch->exec_count; i++) {
struct iris_bo *bo = batch->exec_bos[i];
@ -1029,8 +1035,6 @@ _iris_batch_flush(struct iris_batch *batch, const char *file, int line)
iris_finish_batch(batch);
update_batch_syncobjs(batch);
if (INTEL_DEBUG(DEBUG_BATCH | DEBUG_SUBMIT | DEBUG_PIPE_CONTROL)) {
const char *basefile = strstr(file, "iris/");
if (basefile)