Temporarily abandons the idea to fuse waiters with execution.
For whatever reason, this seemed to cause random flicker in Halo Infinite
with async compute on, and I have failed to figure out exactly why.
By playing around with how commands are fused, the results changed
dramatically, which means I doubt vkd3d-proton was actually at fault
here.
There is some questionable code around UpdateTileMappings in the game
where a COPY queue is used, and it does not seem to synchronize this with other
queues as far as I can tell. It is uncertain at this time if D3D12
requires a tile update to synchronize with *every* queue or just the
queue being submitted to. We assume the latter, as it's the only
behavior that makes sense.
It is possible that submitting waits as they are queued up
affects synchronization between queues in unexpected ways.
When separating out the wait operations, everything appears to work.
It is also simpler code.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Fixes a bug in the logic trying to combine the waits by simplifying the code.
Problem discovered by HK.
Signed-off-by: Derek Lesho <dlesho@codeweavers.com>
Even when misusing the API, S_OK is still returned on native runtimes.
Keep the error log, and add an error report to command allocator release
if there are still pending submissions.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
D3D12 has some unfortunate rules around CommandQueue::Wait().
It's legal to release the fence early, before the fence actually
completes its wait operation.
The behavior on D3D12 is just to release all waiters.
For out of order signal/wait, we hold off submissions,
so we can implement this implicitly through CPU signal to UINT64_MAX
on fence release. If we have submitted a wait which depends on the
fence, it will complete in finite time, so it still works fine.
We cannot release the semaphores early in Vulkan, so we must hold on
to a private reference of the ID3D12Fence object until we have observed
that the wait is complete.
To make this work, we refactor waits to use the vkd3d_queue wait list.
On other submits, we resolve the wait. This is a small optimization
since we don't have to perform dummy submits that only performs the wait.
At that time, we signal a timeline semaphore and queue up a d3d12_fence_dec_ref().
Since we're also adding this system where normal submissions signal
timelines, handle the submission counters more correctly by deferring
the decrements until we have waited for the submission itself.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For internal debug shaders, it is helpful to ensure in-order logs when
sorted for later inspection.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Also be a bit more uniform with using break/return on fail conditions.
Otherwise, the indirect command will read data from the count buffer
instead, which may lead to bugs or GPU hangs.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Transfer batch can clobber graphics pipeline for e.g. depth->color copies.
Hence, flushing the batches after applying the graphics pipeline set by the
app can cause correctness issues.
To prevent that, do the transfer batch flush first before we apply any
render-related states.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@gmail.com>
The D3D12 docs outline this as an implementation detail explicitly, so
we should do the same thing.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Implements the most basic iteration where we don't try to take advantage
of index LUT, hoisting CS patching or attempting to reuse application
indirect buffer directly.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Separate scratch pools by their intended usage.
Allows e.g. preprocess buffers to be
allocated differently from normal buffers, which is necessary on
implementations that use special memory types to implement preprocess
buffers.
Potentially can also allow for separate pools for
host visible scratch memory down the line.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
The runtime is specified to validate certain things.
Also, be more robust against unsupported command signatures, since we
might need to draw/dispatch at an offset. Avoids hard GPU crashes.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Transfer batches buffers CopyTextureRegion calls for batching.
The flushes needs to happen in a few places:
1. ResourceBarrier: This is where the transition from COPY_DEST to other
might happen, at which point the writes must be visible. This might
also transition away from COPY_SRC which invalidates the
precondition.
2. Copy operations. Copies to the same resource are implicitly ordered.
3. Draws and dispatches. These are not strictly necessary, but we don't
want too much command reordering so flushing here seems good.
4. Close. So that we don't throw commands into the void.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@gmail.com>
A parameter preparation stage, a pre-execution barrier stage, then finally
the execution and post-execution barrier stage.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@gmail.com>
Some dynamic state is at risk of being spammed with same arguments many
times. For the dynamic state that is trivial to check, do so.
Ghostwire: Tokyo has been observed to spam the same OMSetStencilRef
value causing some context rolls, also RSSetShadingRate has been set
redundantly.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Primitive restart is only used for strip primitive types, and must be
ignored for lists. Use and require extended_dynamic_state2 for this
purpose.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
{depth,stencil}AttachmentFormat and p{Depth,Stencil}Attachment are only
allowed if the format contains that aspect. Check this explicitly.
Fixes some validation errors.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For EXTENDED_USAGE, we still need to restrict image usage when creating
concrete views.
Use VkImageViewUsageCreateInfo to restrict usage flags to the kind of
view we're creating.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Found some validation errors where rt_count != rtv_active_mask,
and blending used rt_count instead of rtv_active_mask. If shader renders
to a NULL attachment, we must make sure that it's part of the PSO
interface.
Also, use rt_count rather than active mask when beginning render pass.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
This is basically required for not horrible stutter and performance and
is widely supported.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For this case, we want to block and teardown the debug ring thread.
It's okay to fish for dead messages in the ring, since we know there
won't be more GPU work submitted.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
If we expect device losts (breadcrumb debug), we need to use DEVICE uncached/coherent,
since we might not be able to flush GPU caches properly.
We also need to remove the idea of being able to copy out the control
block back to host. This is too brittle and we should instead just place
the control block in PCI-e BAR instead. Rethink how we pass messages
from GPU to CPU to make it more robust.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Spec says that in device lost, driver must return DEVICE_LOST in finite
time, but this does not happen on NV drivers. Use a long timeout instead
in this scenario.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
AMD path for this commit.
Idea is that we can automatically instrument markers with command list
information we can make some sense of in vkd3d-proton.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>