Shared memory is local to CTA, thus we should only wait for
prior memory writes which are visible to other threads in
the same CTA, and not at global level. This should speedup
compute shaders which use shared memory.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Only one face of Cubetextures was locked when in DEFAULT Pool.
Fixes:
https://github.com/iXit/Mesa-3D/issues/129
CC: "12.0 13.0" <mesa-stable@lists.freedesktop.org>
Signed-off-by: Axel Davy <axel.davy@ens.fr>
In the format fallback path,
the height was used instead of the depth.
CC: "12.0 13.0" <mesa-stable@lists.freedesktop.org>
Signed-off-by: Axel Davy <axel.davy@ens.fr>
We are not sure exactly what needs to be 0 initialized,
but we are missing some cases. 0 initialize all our current
aligned allocation.
Fixes Tree of Savior visual issues.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Add implementation for align_calloc,
which is align_malloc + memset.
v2: add if (ptr) before memset.
Fix indentation.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Leak introduced by:
a83dce0128
The patch also moves the part to
release changed.vs_const_i and changed.vs_const_b
before the if (!cb.buffer_size) check,
to avoid reuploading every draw call if
integer or boolean constants are dirty, but the shaders
use no constants.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
CC: "13.0" <mesa-stable@lists.freedesktop.org>
nvdisasm does not print a .S even though the bit is set.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
radeonsi also does the same thing. I suspect that this is likely to be a
no-op in reality, but it brings nouveau code closer to what the blob
produces. Plus it makes sense to not try to do auto-derivatives on this.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This allows the driver to signal that it can't handle random
interleaving of attributes across buffers. This is required for
ARB_transform_feedback3, and it's initialized to whatever the previous
value of PIPE_CAP_STREAM_OUTPUT_PAUSE_RESUME was except for nv50 where
it is disabled. Note that the proprietary drivers never expose
ARB_transform_feedback3 on any GT21x's (where nouveau previously did),
and after some effort I was unable to get it to work.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Getting stores to NIR regs to not generate new MOVs is tricky, since the
result we're trying to store into the NIR reg may have been from a
conditional update of a temp, or a series of packed writes. The easiest
solution seems to be to require that nir_store_dest()'s arg comes from an
SSA temp.
This causes us to put in a few more temporary MOVs in the NIR SSA dest
case, but copy propagation successfully cleans those up.
The shader-db change is modest:
total instructions in shared programs: 93774 -> 93598 (-0.19%)
instructions in affected programs: 14760 -> 14584 (-1.19%)
total estimated cycles in shared programs: 212135 -> 211946 (-0.09%)
estimated cycles in affected programs: 27005 -> 26816 (-0.70%)
but I was seeing patterns in some register-allocation failures in DEQP
tests that looked like the extra MOVs would increase maximum register
pressure in loops. Some debug code indicates that that's not the case,
though I'm still a bit confused by that result.
Now we aren't limited to 256MB total allocated across a driver instance,
just 256MB at one time. We're still copying in and out, which should get
fixed.
Rather than having simulator mode changes scattered around vc4_bufmgr.c
and vc4_screen.c, make vc4_bufmgr.c just call a vc4_simulator_ioctl, which
then dispatches to a corresponding implementation.
This will give the simulator support a centralized place to do tricks like
storing most BOs directly in simulator memory rather than copying in and
out.
This leaves special casing of mmaping BOs and execution, because of the
winsys mapping.
The loop is scanning until the original max_ip (size of the BO), but we
want to not examine any code after the PROG_END's delay slots. There was
a block trying to do that, except that we had some early continue
statements if the signal wasn't a PROG_END or a BRANCH.
The failure mode would be that a valid shader is rejected because some
undefined memory after the PROG_END slots is parsed as a branch and the
rest of its setup is illegal. I haven't seen this in the wild, but
valgrind was complaining and the new userland simulator code started
triggering it.
A constant value of float type is not necessarily a ConstantFP: it could also
be a constant expression that for some reason hasn't been folded.
This fixes a regression in GL45-CTS.arrays_of_arrays_gl.InteractionFunctionCalls2
that was introduced by commit 3ec9975555.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This reverts commits 1af0641db3 and
a6ad49cbbd.
st/mesa adjusts the rasterizer state for us now.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Long story short, 3D and CP are aliased on Fermi and initializing
compute after pushing the MS sample coordinate offsets seems to
corrupt 3D state for weird reasons.
I still don't have the faintest clue what is going on, but
this seems to only affect Fermi generation. A possible fix
could be to use two different channels, one for 3D and one
for CP.
This fixes a bunch of regressions pinpointed by piglit.
Fixes: "nvc0: fix up image support for allowing multiple samples"
Cc: "13.0" <mesa-stable@lists.freedesktop.org>
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
This makes shader-db reports results for compute shaders.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
With ARB_gpu_shader5, texture offsets can be any source, including TEMPs
and IN's. Make sure to process them as regular sources so that we pick
up masks, etc.
This should fix some CTS tests that feed offsets directly to
textureGatherOffset, and we were not picking up the input use, thus not
advertising it in the shader header.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Tested-by: Dave Airlie <airlied@redhat.com>
Cc: 12.0 13.0 <mesa-stable@lists.freedesktop.org>
The state tracker tries to attach the info to the wrong shader. This is
easy enough to protect against.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: 12.0 13.0 <mesa-stable@lists.freedesktop.org>
The predicate is always CC_NOT_P as defined in
processSurfaceCoordsNVE4(), so we only want to emit OR.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
These constant value VS PARAM exports:
- 0,0,0,0
- 0,0,0,1
- 1,1,1,0
- 1,1,1,1
can be loaded into PS inputs using the DEFAULT_VAL field, and the VS exports
can be removed from the IR to save export & parameter memory.
After LLVM optimizations, analyze the IR to see which exports are equal to
the ones listed above (or undef) and remove them if they are.
Targeted use cases:
- All DX9 eON ports always clear 10 VS outputs to 0.0 even if most of them
are unused by PS (such as Witcher 2 below).
- VS output arrays with unused elements that the GLSL compiler can't
eliminate (such as Batman below).
The shader-db deltas are quite interesting:
(not from upstream si-report.py, it won't be upstreamed)
PERCENTAGE DELTAS Shaders PARAM exports (affected only)
batman_arkham_origins 589 -67.17 %
bioshock-infinite 1769 -0.47 %
dirt-showdown 548 -2.68 %
dota2 1747 -3.36 %
f1-2015 776 -4.94 %
left_4_dead_2 1762 -0.07 %
metro_2033_redux 2670 -0.43 %
portal 474 -0.22 %
talos_principle 324 -3.63 %
warsow 176 -2.20 %
witcher2 1040 -73.78 %
----------------------------------------
All affected 991 -65.37 % ... 9681 -> 3353
----------------------------------------
Total 26725 -10.82 % ... 58490 -> 52162
v2: treat Undef as both 0 and 1
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> (v1)
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com> (v1)
Found that information message while replaying a trace from
Metro 2033 Redux. Mark that property as useless for now.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
The per-element fetch has quite some calculations which are constant,
these can be moved outside both the per-element as well as the main
shader loop (llvm can figure out it's constant mostly on its own, however
this can have a significant compile time cost).
Similarly, it looks easier swapping the fetch loops (outer loop per attrib,
inner loop filling up the per vertex elements - this way the aos->soa
conversion also can be done per attrib and not just at the end though again
this doesn't really make much of a difference in the generated code). (This
would also make it possible to vectorize the calculations leading to the
fetches.)
There's also some minimal change simplifying the overflow math slightly.
All in all, the generated code seems to look slightly simpler (depending
on the actual vs), but more importantly I've seen a significant reduction
in compile times for some vs (albeit with old (3.3) llvm version, and the
time reduction is only really for the optimizations run on the IR).
v2: adapt to other draw change.
No changes with piglit.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Previous attempts to zero initialize all inputs were not really optimal
(though no performance impact was measurable). In fact this is not really
necessary, since we know the max number of inputs used.
Instead, just generate fetch for up to max inputs used by the shader,
directly replacing inputs for which there was no vertex element by zero.
This also cleans up key generation, which previously would have stored
some garbage for these elements.
And also drop the assertion which indicates such bogus usage by a
debug_printf (the whole point of initializing the undefined inputs was to
make this case safe to handle).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Compilation to actual machine code can easily take as much time as the
optimization passes on the IR if not more, so print this out too.
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
For the texturing packs, things looked pretty terrible. For every
lerp, we were repacking the values, and while those look sort of cheap
with 128bit, with 256bit we end up with 2 of them instead of just 1 but
worse, plus 2 extracts too (the unpack, however, works fine with a
single instruction, albeit only with llvm 3.8 - the vpmovzxbw).
Ideally we'd use more clever pack for llvmpipe backend conversion too
since we actually use the "wrong" shuffle (which is more work) when doing
the fs twiddle just so we end up with the wrong order for being able to
do native pack when converting from 2x8f -> 1x16b. But this requires some
refactoring, since the untwiddle is separate from conversion.
This is only used for avx2 256bit pack/unpack for now.
Improves openarena scores by 8% or so, though overall it's still pretty
disappointing how much faster 256bit vectors are even with avx2 (or
rather, aren't...). And, of course, eliminating the needless
packs/unpacks in the first place would eliminate most of that advantage
(not quite all) from this patch.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
During dual instance encoding submission, if the second encode task and first
encode task have no reference dependency, e.g. p following with idr-frame,
there is a chance the second task will use for its reconstructed picture
buffer the same buffer used by first task for its reference/reconstructed
picture. In this case, buffer corruption may occur depending on encoding
speed. Fix is to force flush these two tasks separately to avoid race condition
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=98005
Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reminiscent from the pre-loader days, were we had multiple instances of
the loader logic in separate places and one could build a "GALLIUM_ONLY"
version.
Since that is no longer the case and the loaders (glx/egl/gbm) do not
(and should not) require to know any classic/gallium specific we can
drop the argument and the related code.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The indirect handle has to come right after the coordinates, so if there
was a sample/bias/depth compare/offset, everything would end up being
shifted by one argument position.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
As specified in va.h, default value should be set on attributes
not present in the input list.
Signed-off-by: Julien Isorce <j.isorce@samsung.com>
Reviewed-by: Mark Thompson <sw@jkqxz.net>
Vulkan doesn't set these fields even though it doesn't use HiS.
HiS is disabled by programming DB_SRESULTS_COMPARE_STATEn to 0.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Split the source immediate value into new values and move them into the
original defs set by the split. Since we can only have up to 64-bit
immediates, this is largely beneficial for F64 (and, in the future, U64)
operations.
Signed-off-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
[imirkin: always use U32, set newi for foldCount tracking]
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Historically we use "device name" for the name of the kernel module and
"driver name" for the dri/other driver.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Likely unused since day 1, although I've only checked back until the
st/dri unification with commit 29ca7d2c94 ("st/dri: merge dri/drm and
dri/sw backends")
Based on the comment, referencing drmOpenByName it's not something we
want to bring back.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Constrained baseline profile is supported, so use that instead. This
matches what the encoder already does (constraint_set1_flag is always
set in the output bitstream).
Reviewed-by: Christian König <christian.koenig@amd.com>
This makes the supported format actually match the configuration, and
allows the user to observe that NV12 is supported for video processing
where previously they couldn't (though it did always work if they
blindly tried to use it anyway).
Reviewed-by: Christian König <christian.koenig@amd.com>
Both YUV420 and RGB32 configurations are supported, so we need to be
able to distinguish which is being used.
Reviewed-by: Christian König <christian.koenig@amd.com>
The encoder attributes are needed for a user of the encoder to be
able to configure it sensibly without internal knowledge.
Reviewed-by: Christian König <christian.koenig@amd.com>
Work in progress (disabled).
USE_8x2_TILE_BACKEND define in knobs.h enables AVX512 code paths
(emulated on non-AVX512 HW).
Signed-off-by: Tim Rowley <timothy.o.rowley@intel.com>
src2 was being given the wrong modifier, and we were not properly
managing the modifier on the SHL source either.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Previously, the plan was "if the width/height we have to load/store isn't
the size the user is planning on writing, then we need to load the old
contents out beforehand to prevent writing back undefined".
However, when we're doing glTexImage() we often end up aligning the
width/height into the padding of the texture, and we don't actually
need to read out that padding.
Improves x11perf -aatrapezoid100 performance from ~460/sec to
~700/sec.
Regression introduced by
ba0274c7d6
Check the resource exists before assigning it
a flag (and use This->base.resource instead
of pResource, since the former may have a newly
allocate resource, while the latter would be
NULL).
This should reintroduce the behaviour of previous
code.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Since 1604efa6fd,
lconsti and lconstb don't need to be initialized.
Remove some leftovers from the previous code (which
has now invalid use of ARRAY_SIZE on a pointer instead
of an array).
Reported by Coverity.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Use uint64_t instead of int64_t in the calculation,
as the result is uint64_t.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
All ARB_enhanced_layouts piglit tests pass without any changes
in our compiler.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
The table was copied from the Vulkan driver. The comment lines are as long
as the table for cosmetic reasons.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This is a serious performance fix. Discovered by luck.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94354
Cc: 12.0 <mesa-stable@lists.freedesktop.org>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
so that decompress blits aren't needed and depth texturing needs less
memory bandwidth.
Z16 and Z24 are promoted to Z32_FLOAT by the driver, because TC-compatible
HTILE only supports Z32_FLOAT. This doubles memory footprint for Z16.
The format promotion is not visible to state trackers.
This is part of TC-compatible renderbuffer compression, which has 3 parts:
DCC, HTILE, FMASK. Only TC-compatible FMASK compression is missing now.
I don't see a measurable increase in performance though.
(I tested Talos Principle and DiRT: Showdown, the latter is improved by
0.5%, which is almost noise, and it originally used layered Z16,
so at least we know that Z16 promoted to Z32F isn't slower now)
Tested-by: Edmondo Tommasina <edmondo.tommasina@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
For performance tuning in drivers. It filters out window system
framebuffers and OpenGL renderbuffers.
radeonsi will use this to guess whether a depth buffer will be read
by a shader. There is no guarantee about what will actually happen.
This is a departure from PIPE_BIND flags which are defined to be strict
but they are useless in practice.
Acked-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Fixes GL45-CTS.shader_image_load_store.basic-allTargets-atomic*
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Recent fix for non-const offsets broke the case of a single offset (vs 4
offsets). The later code relies on the offs array to contain null values
to tell whether they should be added onto the srcs list.
Fixes: 5239bd592 ("nvc0/ir: fix overwriting of value backing non-constant gather offset")
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
The offset needs to be properly copied over to the phi value, otherwise
it will get assigned to the base of the merge instead of the proper
location.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
v2: mark llvmpipe & softpipe properly as well (Jason Wood)
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Dave Airlie <airlied@redhat.com>
For specifying an exact location/component.
v2: change the order of parameters (Dave)
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net> (v1)
Reviewed-by: Dave Airlie <airlied@redhat.com> (v1)
v2: change the order of parameters (Dave)
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net> (v1)
Reviewed-by: Dave Airlie <airlied@redhat.com> (v1)
This is a screen cap because drivers are expected to support it either
for all shader types or for none of them.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Dave Airlie <airlied@redhat.com>
radeonsi no longer supports pixel shaders without interpolation optimizations,
which led to assertion failures in si_shader_ps when running shader-db.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
total instructions in shared programs :2286901 -> 2284473 (-0.11%)
total gprs used in shared programs :335256 -> 335273 (0.01%)
total local used in shared programs :31968 -> 31968 (0.00%)
local gpr inst bytes
helped 0 41 852 852
hurt 0 44 23 23
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
This should make the code more robust if a shader tries to use inputs which
aren't defined by the vertex element layout (which usually shouldn't happen).
No piglit change.
Reviewed-by: Brian Paul <brianp@vmware.com>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
SwrStoreTiles now takes a mask of surfaces to store. Reduces
overhead when storing multiple render targets.
Signed-off-by: Tim Rowley <timothy.o.rowley@intel.com>
Align and use streaming store instructions for BE fifo queues.
Provides slightly faster enqueue and doesn't pollute the caches.
Add appropriate memory fences to ensure streaming writes are
globally visible.
Signed-off-by: Tim Rowley <timothy.o.rowley@intel.com>
On Intel Pineview M hardware, the i915 gallium driver doesn't output
the correct gl_FragCoord. It seems to always have an X coord of 0.0
and a Y coord of the window's height in pixels, e.g. 600.0f or such.
I believe this is a regression caused in part by this commit:
afa035031f
The old behavior used the output at index zero, while the new behavior
uses actual zeroes. In the case of gl_FragCoord the output at index
zero happened to be the correct one, so the behavior appeared correct
although the code already had a bug.
Fixed by checking for I915_SEMANTIC_POS when setting up texCoords. If
the generic_mapping is I915_SEMANTIC_POS, look for the
TGSI_SEMANTIC_POSITION instead of a TGSI_SEMANTIC_GENERIC output.
https://bugs.freedesktop.org/show_bug.cgi?id=97477
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>
Tested-by: Stéphane Marchesin <marcheu@chromium.org>
Return error instead of crashing on source surfaces
with format D3DFMT_NULL.
Fix for issue #236.
Tested on Windows 7.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Wine tests show that cubetextures always use
PIPE_TEX_WRAP_CLAMP_TO_EDGE regardless of set
sampler states.
Fixes failing d3d9 wine test test_cube_wrap.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Add an assert to make sure buffer creation doesn't fail.
Add error handling in calling functions.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
The removed code was there for two reasons:
1) Allow DF16, DF24, INTZ to be used as depth buffer
for swapchain, if the driver doesn't support
PIPE_BIND_SAMPLER_VIEW for the underlying format
2) Set PIPE_BIND_SAMPLER_VIEW if possible, such that
if StretchRect is called on the depth texture, it is happy.
1) The reason these formats needed a workaround is because
the check flags for them in CheckDeviceFormat were incorrect,
which led applications to think the formats were valid for
swapchains, even if they weren't supported.
2) StretchRect limitations for depth buffers force
the resource_copy_region path, which should be fine without
PIPE_BIND_SAMPLER_VIEW.
Thus fix the check for 1), and remove the code.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Advertise quality levels:
Each supported multisample count matches to one quality level.
The application doesn't know how much samples each quality level has.
For that reason it's not possible to set the multisample mask.
Return errors on quality level missmatch.
Fixes several old games not having multisample support until now.
Fix for issue #73.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Compare resource's nr_samples instead of D3D multisample level.
Required for multisample quality levels to work correct.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Use strict aliasing in SetPrivateData and struct pheader.
Casting char[1] to IUnknown** isn't allowed in strict aliasing.
Compute pointer to body by adding size of header to header pointer.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Remove {Set/Get/Free}PrivateData in resource9.
Functionality has been implement in IUnknown interface.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Remove {Set/Get/Free}PrivateData in volume9.
Functionality has been implement in IUnknown interface.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Switch {Set/Get/Free}PrivateData function to introduced IUnknown functions.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Implement {Set/Get/Free}PrivateData in iunknown to get rid
of duplicated code in resource9 and volume9.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
According to MSDN the device is returned for surfaces that do
not have a regular container.
Such surfaces are:
OffscreenPlainSurface, DepthStencilSurface and RenderTarget
Tested and verified on Windows.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Allocate resources in surface ctor.
Allows to use statetracker internal memory accounting.
Fix for issue #231.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Signed-off-by: Axel Davy <axel.davy@ens.fr>
D3DFMT_NULL is mapped to PIPE_FORMAT_NONE.
Instead of relying on PIPE_FORMAT_NONE to
return a size, pick one.
The one picked is the same than Wine.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Add DBG calls to NineTexture9_GetLevelDesc and
NineTexture9_GetSurfaceLevel to ease debugging.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Tests showed that is allowed to call this method on
object that have a zero refcount.
Required for issue #230.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Same change than for vs ff.
This makes it easier to not introduce mistakes
reusing temporaries whose result shouldn't be
erased.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
Error found with wine tests.
nine_shader was expecting another order
than the one device9 was using.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
Thanks to wine tests.
Apparently 4x4 inverse is to be used, and
if the inverse can't be calculated, the
input matrix is to be used.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Wine tests for the passthrough feature are for positiont.
Nothing seems to indicate passthrough happens when positiont
it not used. However having passthrough with positiont makes
sense (to be used with ProcessVertices outputs).
Signed-off-by: Axel Davy <axel.davy@ens.fr>
We were (wrongly) adding specular to diffuse
in vertex shaders when SPECULARENABLE was set.
However the spec says specular has to be added
after texture processing (which is in ps).
Besides SPECULARENABLE is flagged as a pixel state.
There was unused support for SPECULARENABLE
in the ps ff code.
Remove the vs code, and use the ps code.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
There are several holes. This patch reduces
the holes a bit, which reduces the size of
the constant buffer uploaded.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
This has been a real mess up to now: the temporaries
were allocated once, and shared after that between
the different parts of the code.
To help maintaining the code, the temporaries are now
allocated and released on need.
As surprising as it could be, this patch, which was
supposed to introduce no behaviour change, actually
solved a visual bug observed on a sample program.
This was due to ureg_normalize3 polluting a temporary
variable.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
When this state is set, the normals computed
in the vs ff shader should be normalized.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
Software Vertex Processing allows:
. Less limitations for shaders (more loops, etc)
. Less limitations for ff (more enabled lights, 255
matrices for VertexBlend)
In particular shaders can get more constants.
This patch implements support for this (not using software
rendering, but hardware rendering, as llvmpipe and dx10+ hw
have the same limits...)
This is considered a second class path. Even apps asking for
"Mixed Vertex processing" (ie the ability to switch to swvp
on demand) do not use the feature much. Some just initialize
more constants than the normal limit at the start of the
application, but never use more than the normal limit.
When the apps do not need the software vertex processing
features, they do not seem to turn it on. This means it is
ok if that path is slow.
Thus no care has been made to make the path optimized.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
This path has been disabled for some time because
of some bugs with it. It hasn't been updated to the
new features, and is not faster.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
In mixed vertex processing, the user can enable or disable
software vertex processing. It is on hardware by default.
This feature is not a state, and thus the setting doesn't
need to be recorded by stateblocks.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Buffers with this flag must be usable with both software
and hardware vertex processing. Use Staging for fast cpu access.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
ATIx are "unknown" formats that do not follow block format conventions.
Tests showed that pitch*height bytes are allocated.
apitrace used to depend on this behaviour.
It used to copy more bytes than it has to for the ATI1 block format,
but it didn't crash on Windows.
Increase buffersize for ATI1 to fix this crash.
The same issue was present in WINE but a patch has been sent by me.
Signed-off-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
To implement the feature we copy the ps inputs to a temp array.
This is not optimal for performance, but it is the simplest solution.
This is a feature that is very very rarely used.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Gallium nine relies on aliasing to work with this function.
Without this patch, dirty region tracking was incorrect, which
could lead to incorrect textures or vertex buffers.
Fixes several game bugs with nine.
Fixes https://github.com/iXit/Mesa-3D/issues/234
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Patrick Rudolph <siro@das-labor.org>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Cc: "12.0" <mesa-stable@lists.freedesktop.org>
On 32 bits system, application memory is quite limited.
softpipe uses application memory. To help prevent memory
exhaustion, limit reported memory availability to 2GB.
Some gallium nine apps do check reported memory by allocating
resources until memory is full. Gallium nine refuses allocations
when 80% of the reported memory limit is used. This change
helps some apps to start.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
On 32 bits system, application memory is quite limited.
llvmpipe uses application memory. To help prevent memory
exhaustion, limit reported memory availability to 2GB.
Some gallium nine apps do check reported memory by allocating
resources until memory is full. Gallium nine refuses allocations
when 80% of the reported memory limit is used. This change
helps some apps to start.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
On systems with more than 4GB of ram,
os_get_total_physical_memory was triggering an integer
overflow for the linux and haiku path, when on
32 bits.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94561
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Fixes regression introduced by
ecd6fce261
and is more future proof than just clearing the next
field.
Other nine usages did already zero out the templates.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Acked-by: Edward O'Callaghan <funfunctor@folklore1984.net>
When offset != 0, the valid range was wrong because the second
argument of util_range_add() is end, not size.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Normally the value is an immediate, which is moved to some temporary, so
there's no problem. In the case of a non-constant offset (as allowed by
ARB_gpu_shader5), we have to take care to copy it first before using it
to build up the bits.
This fixes a compilation error observed in F1 2015.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Cc: mesa-stable@lists.freedesktop.org
A function with multiple returns would have had multiple preret settings
at the top of the function. While this is unlikely to have caused issues
since we don't use functions in earnest, it could have in some cases
overflowed the call stack, in case a function had a lot of early
returns.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
This can be helpful with R600_DEBUG=preoptir.
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Not sure if it's possible to avoid programming the block size twice (once for
the userdata and once for the dispatch).
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Mostly test code, plus one spot I noticed in r600.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
We will only hit this with multi-planar YUV external images, so we would
probably never hit this code path in the first place. But if we did, it
wouldn't do the right thing so just bail.
Signed-off-by: Rob Clark <robdclark@gmail.com>
We have to be careful to not smash the value they're clearing to, but
other than that we're fine. Avoids quad clears in Processing, which likes
to do glClear(Z|S); glClear(Z).
Improves performance of Processing's QuadRendering demo at 5000 quads by
5.46507% +/- 1.35576% (n=15 before, 32 after)
We were incrementing the count at the end of vc4_start_draw(), except that
that function returns immediately if we've already started drawing on this
batch. It also failed to count the statechanges from the GFXH-515
workaround.
This incidentally allows repeated glClear() to be coalesced, because the
fast clears aren't counted in draw_calls_queued any more. Fixes most of
the extra flushes in Processing, which emits glClear(Z|S); glClear(Z);
glClear(C) during its frame setup.
Improves performance of Processing's QuadRendering demo at 5000 quads by
3.33538% +/- 2.05846% (n=21 before, 15 after)
This slightly reduces instructions on shader-db, but I think it's just
perturbing register allocation -- the allocator should have always
trivially colored these nodes, before. This commit is just to make QIR
code failing more intelligible when register allocation fails.
If a conditional assignment is only conditioned on the exec mask, that's
still screening off the value in the executed channels (and, since we're
not storing to the unexcuted channels, we don't care what's in there).
Fixes a bunch of extra register pressure on Processing's Ribbons demo,
which is failing to allocate.
We would assertion fail in setting up the simulator the second time
around. This at least postpones the assertion failure until we've closed
all of the first set of screens and started opening a new set.
Checking if MAD is supported is definitely wrong, and it's
more likely a typo I introduced few days ago which breaks
NV50 because SHLADD is not supported there.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
When the chipset is forced with NV50_PROG_CHIPSET, we actually
only want to output the binary if NV50_PROG_DEBUG is also
enabled. Otherwise, this pollutes the shader-db output.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Only expose 512 threads/block on Fermi to not be limited by
32 GPRs/thread.
v4: - use 512 threads on Fermi, 1024 on Kepler+
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
When a variable local size is defined as specified by
ARB_compute_variable_group_size, the fixed local size is set to 0
and a SIGFPE occurs when we compute the maximum number of regs.
This allows to use 64 GPRs/thread.
v4: - use 512 threads on Fermi, 1024 on Kepler+
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
v3: - use a new case statement in r600_pipe_common.c
- fix compilation of softpipe...
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
helped some ue4 demos and divinity OS shaders
total instructions in shared programs : 2818674 -> 2818606 (-0.00%)
total gprs used in shared programs : 379273 -> 379273 (0.00%)
total local used in shared programs : 9505 -> 9505 (0.00%)
total bytes used in shared programs : 25837792 -> 25837192 (-0.00%)
local gpr inst bytes
helped 0 0 33 33
hurt 0 0 0 0
Signed-off-by: Karol Herbst <karolherbst@gmail.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Pierre Moreau <pierre.morrow@free.fr>
One of NIR's invariants is that control flow lists always start and end
with blocks. There's no good reason why we should return a cf_node from
these functions since we know that it's always a block. Making it a block
lets us remove a bunch of code.
Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Otherwise it won't be picked in the tarball and the build will fail.
Fixes: 0035f7f136 ("svga: add guest statistic gathering interface")
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Similar to the other 'tests', enable assertions in xvmc_bench.
This silences the GCC warnings about unused-variable(s), makes the
program actually useful, as the XvMC API called. Atm the function
calls are omitted, since they're called within the assert.
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Currently, program binaries are only dumped at upload time, but
when the chipset has been forced via NV50_PROG_CHIPSET we might
want to show the generated code, especially with shaderdb.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
There are VM faults without this.
Cc: 12.0 <mesa-stable@lists.freedesktop.org>
Acked-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
which is returned for GL_MAX_TEXTURE_BUFFER_SIZE.
It doesn't have any other use at the moment.
Bigger allocations are not rejected.
This fixes GL45-CTS.texture_buffer.texture_buffer_max_size on Bonaire.
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
(v1 pushed, then reverted)
This fixes 9 randomly failing tests on radeonsi:
GL45-CTS.shader_multisample_interpolation.render.interpolate_at_centroid.*
v2: use input_interpolate[input] (correct) instead of
input_interpolate[index] (incorrect)
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>