Adds the "upload_hvv" config flag, which will make D3D12_HEAP_TYPE_UPLOAD attempt to use host-visible VRAM for allocations.
This takes advantage of large or resizable BAR if available.
I see a perf delta of 83-84 -> 92-94 (~12%) when using this in Horizon Zero Dawn.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
FloatControlProperties struct appears to be broken, and it does seem to
work just fine.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
WaveMatch and WaveMultiPrefix are implemented and pass test.
Other features are gated behind feature bits.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
From native testing, we can expose higher shader models if
cap bits features are not supported. E.g. Polaris exposes SM 6.5, even
when 16-bit and barycentrics are not supported.
With latest dxil-spirv updates we can support the required SM 6.4
features.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
This was passing through flags of the root signature not the shader interface flags of it.
Need to get the shader interface flags of the root signature instead.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Goal here is to avoid unnecessary image layout transitions when render
passes toggle depth-stencil PSO states. Since we cannot know which
states a resource is in, we have to be conservative, and assume that
shader reads *could* happen.
The best effort we can do is to detect when writes happen to a DSV
resource. In this scenario, we can deduce that the aspect cannot be
read, since DEPTH_WRITE | RESOURCE state is not allowed.
To make the tracking somewhat sane, we only promote to OPTIMAL if an
entire image's worth of subresources for a given aspect is transitioned.
The common case for depth-stencil images is 1 mip / 1 layer anyways.
Some other changes are required here:
- Instead of common_layout for the depth image, we need to consult the
command list, which might promote the layout to optimal.
- We make use of render pass compatibility rules which state that we can
change attachment reference layouts as well as initial/finalLayout.
To make this change, a pipeline will fill in a
vkd3d_render_pass_compat struct.
- A command list has a dsv_plane_optimal_mask which keeps track
of the plane aspects we have promoted to OPTIMAL, and we know cannot
be read by shaders.
The desired optimal mask is (existing optimal | PSO write).
The initial existing optimal is inherited from the command list's
tracker.
- RTV/DSV/views no longer keep track of VkImageLayout. This is
unnecessary since we always deduce image layout based on context.
Overall, this shows a massive gain in HZD benchmark (RADV, 1440p ultimate, ~16% FPS on RX 6800).
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Idea is to keep track of scenarios where we know a resource's aspect is
known to be in a OPTIMAL state. Based on this, we can override the image
layout from the common_layout in order to avoid unnecessary full
barriers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For copies, we can always use the intended aspects, since we have
separate DS layouts now.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
When clearing a DSV, we must get aliasing guarantees, so we must
transition away from UNDEFINED. This is only possible when using
separate_ds_layouts and for render pass clears we need to use
renderpass2 mechanisms to do this.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
The clamp is absolute, not relative to baseMip. Also avoids validation
error and potential crash when LODClamp > numLevels.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Not correct, will need spec additions to handle it properly.
Fixes ground rendering in DIRT 5.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
GravityBench ends up using ClearView with too large dimensions.
This is a validation error in Vulkan, so just clamp the extents.
To make full rect detection a bit more robust, do a range check instead
of memcmp().
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
If we're doing a layout transition of depth-stencil aspects, we need to ensure all potential
accesses are made visible.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
- Honor resource barriers for resource states which cannot automatically
decay or promote. This includes COLOR_ATTACHMENT, UNORDERED_ACCESS and
VRS image. If SIMULTANEOUS_ACCESS is used, we can still promote, and
we handle that by setting common layout to GENERAL for these resources.
- Avoid redundant barriers in render passes since normal resource
barriers will always make sure we are already in
COLOR_ATTACHMENT_OPTIMAL.
- Do not force GENERAL layout if resource has UNORDERED_ACCESS flag set.
As this is not a promotable state, we have to explicitly transition
into it. I tested this on validation layers, where even COMMON state
refuses to promote to UAV state. The exception here of course is
SIMULTANOUS_ACCESS, but we handle that properly now.
- Verify that UAV or SIMULTANEOUS access is not used together with DSV
state. This is explicitly banned in the API docs.
- Actually emit image barriers. Batch the image transitions as that's
what D3D12 docs encourage app developers to do, and it also expects
that drivers can optimize this. Ensure that we respect the in-order
resource barrier rules by splitting batches if there are overlaps in
the transitions.
- Ensure that correct image layout is used when clearing a suspended
render pass attachment.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Avoid using the separate layouts if we're only using formats with one
aspects. This makes it more likely to match layouts with common layout,
and we can avoid awkward transition barriers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
The spec is pretty clear that it's invalid usage. Return E_INVALIDARG
like native drivers.
This is a workaround for the inventory GPU hang with Cyberpunk 2077
which is actually a game bug. Luckily the game handles this error
properly.
The problem is that the game always assume that an image with 2 mips
is smaller than the same image but with 6 mips. This is not always
true if the swizzle mode is different and a recent Mesa update changed
that. Then the game creates a D3D12 heap that is too small and this
triggered a memory violation and then a GPU hang with RADV.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Our internal copy shaders are fine, but we get benign errors about
sample count being wrong since we alias descriptors.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
We cannot use the memory requirement output, since we will zero-clear
memory with a size that might be larger than the VkBuffer size.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Some games end up writing the wrong descriptor type when using null
descriptors, and to be robust against that, we have to clear out
all descriptors when creating null descriptors.
If we copy a null descriptor, we will also have to copy from all sets.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
In cases where acquire image is blocking, we should call that after
presentation to avoid latency when the app calls present.
This avoids weird inverse frame cadences with Mesa WSI right now,
as acquiring an image is always a blocking call until it is complete.
In cases when we aren't blocking, this kicks off the acquisition so
it can be waited upon by the next present blit pass.
Use another set of semaphores to wait for the image acquisition on the
GPU.
In the non-blocking vkAcquireNextImageKHR case, this means that a
potential bubble of time between waiting on the fence and submitting
the blit + presentation is eliminated.
Runaway presentation in this setup is avoided by frame latency objects
and normal frame latency which is always 3 according to documentation.
Be careful about handling SUBOPTIMAL. Semaphores will be signaled, but
we might want to tear down the swapchain. In these cases, we need to
wait for the semaphore to be signaled first, which can only be done by
submitting a wait, since QueueWaitIdle or DeviceWaitIdle don't cover
WSI.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Co-authored-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Documentation says that this should always be 3 without WAITABLE_OBJECT
unlike in D3D11 where it will use the DXGI device's frame latency.
This stops runaway presentations in the non-blocking acquire image case
with the new semaphore setup.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Fixes test TODOs. Apparently Vulkan drivers can saturate here, which
caused the TODO to appear, at least on AMD Windows.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Roll this into vkd3d_device_create_info, no need for this to be a pNext thing.
Additionally, fix some memory leaks on device creation failure.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Roll this into vkd3d_instance_create_info, no need for this to be a pNext thing.
Additionally, fix some memory leaks on instance creation failure.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Needed so we can switch between having a VRS and non-VRS attachment on the fly.
Extensible enough for this to work for other things down the line also.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
C is fun, yo. Returned data from dead stack variable, also triggered
overflow in some cases.
Uncalled in release mode, but can crash debug builds.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Otherwise we can do an alligned_malloc with a non-aligned size as the descriptor size is 48 for a d3d12_rtv_desc otherwise.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
This can happen if the fence thread starts with a delay and
the queue gets destroyed shortly after being created.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
There isn't much of a reason why we should have to do this. The original
implementation was more of a hack if anything.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Increment physical value one by one, find the exact timeline value we're
supposed to signal and perform the update.
Select lowest physical timeline value correctly.
Array can be reordered now, so lowest value isn't necessarily first.
Fixes some super weird hangs in Control DXR.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Rather than one per device. This solves issues with D3D12 fences
being signalled too late because the fence worker is waiting on
a different set of semaphores while the fence is being enqueued.
Greatly increases performance in Horizon Zero Dawn and Death
Stranding with multi-queue mode enabled.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
This will be necessary once we introduce fence workers per
command queue, since we cannot reliably store pointers to
queues.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Replaces d3d12_device_get_vkd3d_queue when mapping D3D12
command queues to Vulkan device queues.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
The implemnentation is not complete enough to safely enable it, since
some games will try to create RTPSOs by default, leading to crashes.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
It's really supposed to load 4 components and ignore. RGB16 is not
mandatory, so just use the "expected" formats after all.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Compacts ranges and only issues one bind for buffer ranges and
full subresource updates, rather than one bind per tile.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Bindless CBV is *pretty* bad on NVIDIA, so add a code path which can
promote descriptor table CBVs into push descriptors.
We can safely do this with Root Signature 1.1 STATIC or
the somewhat obscure STATIC_KEEPING_BUFFER_BOUNDS_CHECKS.
With VOLATILE, which basically all titles are using,
we can still force this behavior through a config flag,
but this is an incorrect speed hack. It works in most
titles however, since bindless CBV is exceptionally rare.
We only hoist descriptors when the root signature range has 1 descriptor
anyway, so we should avoid any reasonable bindless scenario.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
There is no need to scan through the Vulkan format list,
especially since texel buffer creation happens in the hot path
in cases where we know we need to create R32UI texel buffer views.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
We might have to emit to different bind point than our binding entry
suggests due to DXR, so pass down information explicitly to leaf
functions.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Refactor push constant invalidation to SetPipelineState,
it is technically more correct to only invalidate when actually pushing
constants, but we need to do full state invalidation when transitioning
between RT pipelines and non-RT pipelines due to bind point aliasing
shenanigans in D3D12, so it makes more sense to invalidate state based
on active bind point there.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Enabling VK_EXT_debug_utils comes at some overhead in Wine due to the object tracking required. There is also likely a non-zero overhead in some native implementations also.
By enabling this conditionally, we can also avoid additional overhead from apps that set debug labels on both the Vulkan and front-end side.
The default condition is to enable it when building with Renderdoc integration or in debug builds.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
This thing has no right to exist.
We don't get this information in D3D12 and it's getting in the way of me refactoring config flags.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Simplifies this to make it easier to add new properties/features
so we don't have a bunch of pointers to things that are just a child
of the device info structure.
Fixes warnings when compiling without traces too.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Gives a massive boost on NVIDIA for some reason.
RADV defers push constant update, so ALL_STAGES doesn't have
that much of a perf hit.
~20% uplift in RE2, ~5% uplift in CP77 from some quick and dirty testing.
Seems to be heavily content dependent either way.
Also a bug fix, since we would clobber graphics push constants from
compute and vice versa if both graphics and compute used the same root
signature.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
As per MSDN, SetName is just a wrapper around SetPrivateData and a specific GUID.
Some apps and tools will use this to retrieve their name back.
So instead, just forward the name to Vulkan in the SetPrivateData call.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Instead, infer the required stages from the D3D12 shader visibility
field from all root parameters that we map to push constants.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>