We can mark a descriptor as being SINGLE_DESCRIPTOR, which means we
only need one descriptor copy. This way, we can avoid doing somewhat
expensive work (every nanosecond counts here):
- Bitscan loop
- Read deep into d3d12_device guts (often a cache miss). The memory
index depends on the bitscan, which causes bubble.
When we have a single descriptor, we can just store the binding
information inline and avoid this jank.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Tune memory layout so that we can deduce various information without
making a single pointer dereference:
- d3d12_descriptor_heap*
- heap offset
- Pointer to various side data structures we need to keep around.
Instead of having one big 64 byte data structure with tons of padding,
tune it down to 32 + 8 bytes per descriptor of extra dummy data.
To make all of this work, use a somewhat clever encoding scheme for CPU
VA where lower bits store number of active bits used to encode
descriptor offset. From there, we can mask away bits to recover
d3d12_descriptor_heap. Metadata is stored inline in one big allocation,
and we can just offset from there based on extracted log2i_ceil(descriptor count).
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
This is a more principled limit since that's the huge page size.
Avoids some allocation spam.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Useful for Intel since Intel hardware cannot support more than 1M
descriptors in general, and opting in to correct behavior should improve
CPU overhead as well when copying descriptors.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
The common path that we really need to optimize for is CBV_SRV_UAV +
Simple + 1 descriptor.
Descriptor benchmark shows an almost 50% reduction in overhead now.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
This became basically a rewrite in the end, and it got too awkward to
split these commits in any meaningful way.
The goals here were primarily to:
- Support serializing SPIR-V and load SPIR-V.
To do this robustly requires a lot more validation and checks to make
sure end up compiling the same SPIR-V that we load from cache.
This is critical for performance when games have primed their pipeline
libraries and expect that loading a PSO should be fast. Without this,
we will hit vkd3d-shader for every PSO, causing very long load times.
- Implement the required validation for mismatched PSO descriptions.
- Rewrite the binary layout of the pipeline library for flexibility
concerns and performance.
If the pipeline library is mmap-ed from disk - which appears to be
the intended use - we only need to scan through the TOC to fully parse
the library contents.
From a flexibility concern, a blob needs to support inlined data,
but a library can use referential links. We introduce separate
hashmaps which store deduplicated SPIR-V and pipeline cache blobs,
which significantly drop memory and storage requirements.
For future improvements, it should be fairly easy to add information
which lets us avoid SPIR-V or pipeline cache data altogether if
relevant changes to Vulkan/drivers are made.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Useful when used together with pipeline library logging. Confirms that
we can load pipeline caches as expected.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For pipeline libraries and DXR to some extent later, we'll need an easy
way to compare root signature objects.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
We did not test the scenario where we first render with depth enabled,
and then bind a NULL DSV with the same pipeline.
Also fix issues if we bind NULL RTVs with same pipeline bound.
Fixes crash in Guardians of the Galaxy.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
This key represents the variations of SPIR-V which would be generated
from otherwise identical inputs like DXBC blobs and root signatures.
Typically, changing VKD3D_CONFIG flags or enabled extensions will affect
this key. This ensures that we will not attempt to use a cached SPIR-V
file unless we can trust that the SPIR-V interface will match.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Wraps the D3D12 struct with a pipeline library handle.
This is needed if the blob contains references to external data,
which then needs to be resolved.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
D3D12 expects drivers to implicitly synchronize transfer operations,
since there is no TRANSFER barrier ala UAV barriers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
In DEATHLOOP, there is a render pass which renders out a simple image,
which is then directly followed by a compute dispatch, reading that
image. The image is still in RENDER_TARGET state, and color buffers are
*not* flushed properly on at least RADV, manifesting as a very
distracting glitch pattern. This is a game bug, but for the time being,
we have to workaround it, *sigh*.
For a simple workaround, we can detect patterns where we see these
events in succession:
- Color RT is started
- StateBefore == RENDER_TARGET is not observed
- Dispatch()
In particular, when entering the options menu, highly distracting
glitches are observed in the background.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
If we need to fallback in both VRS and non-VRS scenarios, we need to key
on it. Fixes segfault in DIRT5 when toggling VRS.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
With RTPSOs we might have to create static sampler sets for local root
signatures. In this case we will have to create a compatible pipeline
layout which is equal to global pipeline layout, except for an extra
set.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
When allocating dedicated memory, ignore heap_flag requirements we
deduce from memory info. Any memory type is allowed. This is important
on NV when allocating fallback render targets.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Don't attempt to enter memory allocation when we can invalidate a heap
allocation up front. Avoids some dumb edge cases later.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Need to use fallback pipeline system here.
Keep track of active masks for PSO and current render target.
The intersection of those sets are the attachments which should be
active in the render pass.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
For resizable BAR, we don't want to endlessly promote UPLOAD heaps to
BAR since VRAM is precious. The aim is to set a fixed budget where we
can keep allocating until full, at which point we fall back to plain HOST.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
With BAR budgets, what will happen is that
- Small allocation is requested
- A new chunk is requested
- try_suballocate_memory will end up calling allocate_memory, which
allocates a fallback memory type
- Subsequent small allocators will always end up allocating a new
fallback memory block, never reusing existing blocks.
- System memory is rapidly exhausted once apps start hitting against
budget.
The fix is to add flags which explicitly do not attempt to fallback
allocate. This makes it possible to handle fallbacks at the appropriate
level in try_suballocate_memory instead.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
We will need to consider some form of budgeting, so make sure that all
allocation and freeing is done in a central place.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Need to consider that based on host visibility requirements, we need to
select either LINEAR or OPTIMAL image types, and those tiling modes can
have different memory requirements.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Need to initialize the set mask so that copies happen properly
on default-initialized descriptors. Also, move the current_null_type to
metadata so that it's properly copied on descriptor copy.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
There are titles clearing the same descriptors constantly.
This leads to unnecessary updates that can become costly.
This commit introduces a new flag to track when D3D12 descriptors are
not null, and skips clearing them if they are already null.
Descriptors are assumed to be null by default.
This fixes a performance regression introduced by
9983a1720f
Signed-off-by: Rodrigo Locatti <rlocatti@nvidia.com>
Emitting render pass clears while we're in the process of starting
a render pass overrides dsv layout tracking info.
Signed-off-by: Philip Rebohle <philip.rebohle@tu-dortmund.de>
Get information directly from vkd3d_format and allow for subsampled
formats in the future.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Goal here is to avoid unnecessary image layout transitions when render
passes toggle depth-stencil PSO states. Since we cannot know which
states a resource is in, we have to be conservative, and assume that
shader reads *could* happen.
The best effort we can do is to detect when writes happen to a DSV
resource. In this scenario, we can deduce that the aspect cannot be
read, since DEPTH_WRITE | RESOURCE state is not allowed.
To make the tracking somewhat sane, we only promote to OPTIMAL if an
entire image's worth of subresources for a given aspect is transitioned.
The common case for depth-stencil images is 1 mip / 1 layer anyways.
Some other changes are required here:
- Instead of common_layout for the depth image, we need to consult the
command list, which might promote the layout to optimal.
- We make use of render pass compatibility rules which state that we can
change attachment reference layouts as well as initial/finalLayout.
To make this change, a pipeline will fill in a
vkd3d_render_pass_compat struct.
- A command list has a dsv_plane_optimal_mask which keeps track
of the plane aspects we have promoted to OPTIMAL, and we know cannot
be read by shaders.
The desired optimal mask is (existing optimal | PSO write).
The initial existing optimal is inherited from the command list's
tracker.
- RTV/DSV/views no longer keep track of VkImageLayout. This is
unnecessary since we always deduce image layout based on context.
Overall, this shows a massive gain in HZD benchmark (RADV, 1440p ultimate, ~16% FPS on RX 6800).
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Idea is to keep track of scenarios where we know a resource's aspect is
known to be in a OPTIMAL state. Based on this, we can override the image
layout from the common_layout in order to avoid unnecessary full
barriers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Not correct, will need spec additions to handle it properly.
Fixes ground rendering in DIRT 5.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
- Honor resource barriers for resource states which cannot automatically
decay or promote. This includes COLOR_ATTACHMENT, UNORDERED_ACCESS and
VRS image. If SIMULTANEOUS_ACCESS is used, we can still promote, and
we handle that by setting common layout to GENERAL for these resources.
- Avoid redundant barriers in render passes since normal resource
barriers will always make sure we are already in
COLOR_ATTACHMENT_OPTIMAL.
- Do not force GENERAL layout if resource has UNORDERED_ACCESS flag set.
As this is not a promotable state, we have to explicitly transition
into it. I tested this on validation layers, where even COMMON state
refuses to promote to UAV state. The exception here of course is
SIMULTANOUS_ACCESS, but we handle that properly now.
- Verify that UAV or SIMULTANEOUS access is not used together with DSV
state. This is explicitly banned in the API docs.
- Actually emit image barriers. Batch the image transitions as that's
what D3D12 docs encourage app developers to do, and it also expects
that drivers can optimize this. Ensure that we respect the in-order
resource barrier rules by splitting batches if there are overlaps in
the transitions.
- Ensure that correct image layout is used when clearing a suspended
render pass attachment.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Avoid using the separate layouts if we're only using formats with one
aspects. This makes it more likely to match layouts with common layout,
and we can avoid awkward transition barriers.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
Some games end up writing the wrong descriptor type when using null
descriptors, and to be robust against that, we have to clear out
all descriptors when creating null descriptors.
If we copy a null descriptor, we will also have to copy from all sets.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>