Allows local root signatures to work correctly and is also a good
optimization since we no longer need to dereference memory (potentially
cold cache lines) to figure out heap offset in command buffer.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
If we're signalling and waiting on same physical queue (always true for
current SINGLE_QUEUE define), we can rely on submission boundary
synchronization which doesn't require any extra submissions to resolve.
Avoids awkward GPU driver bubbles with back to back signal -> wait pairs
with timeline.
Observed 2% GPU uplift on RE2 on AMD.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Need it here since local root signatures need to know
the physical layout of the record buffer up front.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Fixes a validation error. With VK_QUERY_RESULT_64_BIT we need
to use 8-byte alignment, but ssbo_alignment may be less.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
No longer requires BDA support since it's easier now to work
around buffer alignment issues.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
The first range will store the byte offset, the second one will
be the typed buffer range. Typed descriptors should write both.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Co-authored-by: Hans-Kristian Arntzen <post@arntzen-software.no>
We currently never reset occlusion queries. For some reason,
validation layers do not report this.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
By resetting query pools in advance, we can reduce the number of
stalls between draw calls in passes with occlusion queries, which
is currently causing serious performance issues in some games.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Since we'll be inserting lots of single queries, we want to
avoid having to resize the range array since that is an O(n)
operation at worst.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Official AMD drivers do not support VK_EXT_conditional_rendering,
so we'll use indirect draws instead to emulate the feature.
This also handles 64-bit predicates in combination with the
Vulkan extension, which was not possible previously.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Potentially avoids some unnecessary host memory access. Use BDA for
the compute shader so that we can ignore alignment restrictions on
some GPU architectures.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Command lists may need to allocate temporary device memory for
certain operations. In order to avoid frequent alloc/free calls,
we'll recycle these scratch buffers until a certain threshold.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Realign VBO strides and offsets if we have to, for sake of
robustness. Violating these rules is against D3D12 spec, but it does not
cause crashes on native drivers. On RDNA we can hit hangs with unaligned
vertex attributes. It appears that native drivers apply some kind of
fixup here to avoid the crash, even if the result is not what we expect.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Ensures that queries are always available and initialized
in the correct order on the GPU timeline.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
We have observed a lot of large GPU bubbles when using back-to-back
timeline semaphores to synchronize GPU submissions. Use prebaked
pipeline barrier command buffers instead.
To resolve queue sparse serialization, use two binary semaphore pairs to
resolve this. There is no need to use timeline semaphores in this case.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
USE_PUSH_DESCRIPTORS may be misleading since it would be set even when
we're not using push descriptors at all due to root descriptors being
passed in via VAs. Instead, make the flag represent whether or not we
use a regular descriptor set for root parameters.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
The packed descriptor index is no longer needed, and causes issues in
case a game sets a root signature, then binds a root descriptor, and
then sets a different root signature which maps the given root parameter
index to a different descriptor since we may now read undefined data
when updating push descriptors.
Fixes#366.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
The struct definitions were identical anyway, and unifying
these will prevent unnecessary code duplication.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
The only currently known use case for this requires us to actually
perform the dispatch operation. Executing more than one indirect
dispatch command is not meaningful, however there might be
differences in behaviour in case the indirect count is zero.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
This logic has to be the same as in d3d12_command_list_update_descriptor_table_offsets,
since not all active descriptor tables are necessarily used by the root signature.
Fixes an assert in the StarsX IrradianceMap demo (Github issue #347).
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
We will not have offset information for root descriptors, so
we can still only use them with four-byte aligned SSBOs.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
We cannot rely on alignment analysis since games are buggy and screw up
RAW vs structured on occasion.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Relevant for swapchain since a swapchain resource can be presented right
away without ever having been touched by an API call.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
It is broken by design and won't be needed by a swapchain
implementation which uses user buffers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Buffer views do not necessarily cover the entire resource, so we
should not spawn more workgroups than necessary to clear the view.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
This is no longer performance-critical, so in order to simplify changing
the binding model, remove hard-coded descriptor set numbers and instead
look them up based on the requested descriptor properties.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Ignore any indexed draw calls which uses a NULL index buffer.
This is not fully correct, but there is no easy way to emulate D3D12
behavior exactly.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For correctness, we will need to defer any initial resource state
handling to the queue timeline. Here, we will build an UNDEFINED ->
common layout barrier if (and only if):
- The resource is marked to care about initial layout transition.
- We are the first queue thread to observe that initial_transition
member is 1 (atomic exchange).
- The first use of the resource was not marked to be a discard.
E.g., if the first use of the resource is an alias barrier, we must
not emit an early barrier. The only we should do here is to clear the
initial_transition member, and leave it like that.
A command list maintains a list of d3d12_resources which *might* need a
transition. For the first frame a resource is used (or so), it will not
have the flag cleared yet, so multiple command lists might add the
d3d12_resource to its own transition list. This is fine, as the queue
will resolve it.
If multiple queues see the same initial transition, there might be
shenanigans, but the application must ensure there is either a
submission boundary or fence boundary between the uses. Any initial
layout transition will only be submitted after a Wait() is observed, as
submission of the transition command buffer will be in-order with other
submissions.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
An optimization and a requirement in D3D12. Clearing out an image
through a copy is considered enough to satisfy the requirement to acquire an
alias in the advanced usage model.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
When building natively on Windows we use dllexport/dllimport for vkd3d/vkd3d_utils public exports.
When building natively on Linux we simply make those visibility default.
Nothing changes for standalone here.
Closes#152
Signed-off-by: Joshua Ashton <joshua@froggi.es>
... if we have dirty vbo slots left.
Fixes textures when inspecting items in the inventory in RE2 and RE3.
Signed-off-by: Robin Kertels <robin.kertels@gmail.com>
There is no resource state associated with this, so emit the barrier at
the end of a command buffer based on trivial tracking.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Need to handle large (> 4G) jumps in timeline value, which is not
supported by all implementations.
There is no good way to handle that, so rewrite and clean up timeline
semaphore handling by separating the timeline into a virtual timeline
(which can rewind and jump around arbitrarely) and a physical timeline
which increments by one each time.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Useful to measure submission times, as well as time spent acquiring the
Vulkan queues. This correlates 1:1 with swapchain as well, so it's
useful when we want to get some "X / frame" metrics.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Manually uses QPC if the Vulkan implementation does not support
the QPC domain by itself.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>