Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Fix the crash of "gnome-control-center info" invocation on QEMU where
zero height is passed at init.
(sroland: simplify logic by eliminating the div altogether, using 64bit mul.)
Fixes: https://bugzilla.novell.com/show_bug.cgi?id=879462
Cc: "10.2" <mesa-stable@lists.freedesktop.org>
Gallium (but not OpenGL) does allow nesting of queries, but there's no
limit specified (d3d10 has no limit neither). Nevertheless, for practical
purposes we need some limit in llvmpipe, otherwise we'd need more complex
handling of queries as we need to keep track of all binned queries (this
only affects queries which gather data past setup). A limit of 16 is too
small though, while 64 would suffice.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This made sense when swizzled storage layout was used for rendering to tiles.
But nowadays the name just adds confusion (and makes for long lines).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Framebuffers can have NULL attachments since a while. llvmpipe handled
that properly for lp_rast_shade_quads_mask but it seems the change didn't
make it to lp_rast_shade_tile.
This fixes piglit fbo-drawbuffers-none test (though I need to increase
the FB_SIZE from 32 to 256 so the tris cover some tiles fully).
https://bugs.freedesktop.org/show_bug.cgi?id=79421
Cc: "10.1 10.2" <mesa-stable@lists.freedesktop.org>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
2ea923cf57 had the side effect of IR counting
now being done after IR optimization instead of before. Some quick analysis
shows that there's roughly 1.5 times more IR instructions before optimization
than after, hence the effective shader cache size got quite a bit smaller.
Could counter this with an increase of the instruction limit but it probably
makes more sense to count them after optimizations, so move that code.
Reviewed-by: Brian Paul <brianp@vmware.com>
When we had just one module "gallivm" was an appropriate name. But now we have
modules containing all functions for a particular variant, so give it a
corresponding name (this is really just for helping debugging).
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The setup shaders were composed of both a fs shader number and a variant
number. But since they aren't tied to a particular fragment shader, the
former was a fixed zero while the latter was also always zero because
it was never assigned. So, similar to what the fs code does, use a ever
increasing number to give it a more catchy name (unlike fragment shaders
though where this number is for each explicitly created shader, we just use
it for the implicitly created variants).
And while here, fix whitespace a bit.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Unused except it was increased for both fs and setup shader variants created.
Probably some leftover from ages ago.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Older versions haven't been tested probably don't work anyway. But more
importantly, code supporting it is hindering further work.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
When the limit was changed to be defined in terms of LP_MAX_SHADER_VARIANTS
(75f1fea14f) when it was increased, this
inadvertently lowered the limit in some branches (that have a lower
LP_MAX_SHADER_VARIANTS number) when merged. So, make sure the limit is always
at least the number it once was.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
GL (3.0) allows you to clear individual color buffers in a fb. In fact
for fbs containing both int and float/normalized color buffers this is
required (because the clearing values are otherwise undefined if applied
to all buffers). The gallium interface was changed a while ago, but llvmpipe
ignored it (hence doing such individual clears always resulted in clearing
all buffers, plus some assorted asserts due to the mixed fbs).
So change the clear command to indicate the buffer to be cleared. Also, because
indicating the buffer to be cleared would have made lp_rast_arg_cmd larger
which is unacceptable (we're trying to shrink it some day) allocate the clear
value in the scene and just pass a pointer.
There's several advantages and disadvantages here:
+ clearing individual buffers works (we could also actually bin such clears now
if they'd come through clear_render_target() if the surface is in the current
fb, though we didn't do this before for the single rb case and still don't try).
+ since there's one clear per rb, we do the format conversion in setup rather
than per bin. Aside from the (drop in the ocean...) performance advantage this
means that clearing to very small values (that is, denormal when converted to
the format) should work for small float (fp16 etc.) formats, as the util code
couldn't handle it correctly before (because cpu denorms are disabled when
executing the bin commands, screwing up the magic conversion and flushing
the values to 0, though this was not verified).
- there's some overhead for traditional old-style clear-all MRT cases, since
there's one rast clear command per rb instead of one for all rbs.
This fixes https://bugs.freedesktop.org/show_bug.cgi?id=76976.
v2: get rid of the ugly manual memcpy stuff and just use union util_color.
This is 32 bytes instead of 16 but as the allocation is per scene we can live
with those additional 16 bytes (and the additional 128 bytes in the setup
context), which makes the code much more obvious. Suggested by Brian.
Reviewed-by: Brian Paul <brianp@vmware.com>
According to Roland all TGSI support is there in theory.
In practice there are a few piglit failures and crashes, as this hadn't
been tested before.
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Defaults to providing the same offsets as MIN/MAX_TEXEL_OFFSET. For
nvc0, the offset can be -32/31.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This opcode provide support for GL_ARB_texture_query_lod,
Signed-off-by: Dave Airlie <airlied@redhat.com>
[imirkin: rebase, docs update]
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This adds a gallium cap that allows us to fake GL3.0 by
not exposing MSAA on sw rendering.
It also forces the extra extensions needed for GL3.2.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Eliminate lp_vertex_shader, as it added nothing over draw_vertex_shader.
Simplify lp_geometry_shader, as most of the incoming state is unneeded.
(We could also just use draw_geometry_shader if we were willing to peek
inside the structure.)
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
The conversion code for srgb was tuned for n x 4x8bit AoS -> 4 x nxfloat SoA
(and vice versa), fix this to handle also 16bit 565-style srgb formats.
Still not really all that generic, things like r10g10b10a2_srgb or
r4g4b4a4_srgb wouldn't work (the latter trivial to fix, the former would not
require more work to not crash but near certainly need some higher precision
calculation) but not needed right now.
The code is not fully optimized for this (could use more direct calculation
instead of expanding to 8-bit range first) but should be good enough.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The AoS version of ld_build_blend_factor was assuming that if the first
channel was alpha, there were no rgb components.
Fixes glean/blendFunc on System z. No piglit regressions on x86_64.
The shortcut is still used in tests like spec/ARB_framebuffer_object/
fbo-alpha.
Signed-off-by: Richard Sandiford <rsandifo@linux.vnet.ibm.com>
D3D10 allows setting of the internal offset of a buffer, which is
in general only incremented via actual stream output writes. By
allowing setting of the internal offset draw_auto is capable
of rendering from buffers which have not been actually streamed
out to. Our interface didn't allow. This change functionally
shouldn't make any difference to OpenGL where instead of an
append_bitmask you just get a real array where -1 means append
(like in D3D) and 0 means do not append.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This adds support to gallium for a TG4 instruction,
and two CAPs. The first CAP is required for GL_ARB_texture_gather.
The second CAP is required to expose GL_ARB_gpu_shader5.
However so far we haven't found any hardware that natively
exposes the textureGatherOffsets feature from GL, so just
lower it for now. If hardware appears for this we can add
another CAP to allow TG4 to take 4 offsets.
v2: add component selection src and a cap to say
hw can do it. (st can use to help control
GL_ARB_gpu_shader5/GLSL 4.00). Add docs.
v3: rename to SM5, add docs.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
The code re-enabling denorms for small float formats did not recognize
this format due to format handling hacks (mainly, the lp_type doesn't have
the floating bit set).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Tungsten Graphics Inc. was acquired by VMware Inc. in 2008. Leaving the
old copyright name is creating unnecessary confusion, hence this change.
This was the sed script I used:
$ cat tg2vmw.sed
# Run as:
#
# git reset --hard HEAD && find include scons src -type f -not -name 'sed*' -print0 | xargs -0 sed -i -f tg2vmw.sed
#
# Rename copyrights
s/Tungsten Gra\(ph\|hp\)ics,\? [iI]nc\.\?\(, Cedar Park\)\?\(, Austin\)\?\(, \(Texas\|TX\)\)\?\.\?/VMware, Inc./g
/Copyright/s/Tungsten Graphics\(,\? [iI]nc\.\)\?\(, Cedar Park\)\?\(, Austin\)\?\(, \(Texas\|TX\)\)\?\.\?/VMware, Inc./
s/TUNGSTEN GRAPHICS/VMWARE/g
# Rename emails
s/alanh@tungstengraphics.com/alanh@vmware.com/
s/jens@tungstengraphics.com/jowen@vmware.com/g
s/jrfonseca-at-tungstengraphics-dot-com/jfonseca-at-vmware-dot-com/
s/jrfonseca\?@tungstengraphics.com/jfonseca@vmware.com/g
s/keithw\?@tungstengraphics.com/keithw@vmware.com/g
s/michel@tungstengraphics.com/daenzer@vmware.com/g
s/thomas-at-tungstengraphics-dot-com/thellstom-at-vmware-dot-com/
s/zack@tungstengraphics.com/zackr@vmware.com/
# Remove dead links
s@Tungsten Graphics (http://www.tungstengraphics.com)@Tungsten Graphics@g
# C string src/gallium/state_trackers/vega/api_misc.c
s/"Tungsten Graphics, Inc"/"VMware, Inc"/
Reviewed-by: Brian Paul <brianp@vmware.com>
Fixes regression from 9baa45f78b
v2: incorporate a few small changes suggested by Roland.
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
The whole round-pointsize-to-int stuff must only be done with GL legacy
rules (no point_quad_rasterization) or all the wrong edges are lit up.
This was previously in a private branch (d3d pointsprite test complains
loudly otherwise) and got lost in a merge. However, it should certainly
apply to GL point sprite rasterization as well.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
It's possible to bind a smaller buffer as a constant buffer, than
what the shader actually uses/requires. This could cause nasty
crashes. This patch adds the architecture to pass the maximum
allowable constant buffer index to the jit to let it make
sure that the constant buffer indices are always within bounds.
The behavior follows the d3d10 spec, which says the overflow
should always return all zeros, and overflow is only defined
as access beyond the size of the currently bound buffer. Accesses
beyond the declared shader constant register size are not
considered an overflow and expected to return garbage but consistent
garbage (we follow the behavior which some wlk tests expect which
is to return the actual values from the bound buffer).
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Commit eda21d2a30 fixed the rasterization
of points for Direct3D but ended up breaking the rasterization of OpenGL
non-sprite points, in particular conform's pntrast.c test.
The only way to get both working is to properly honour
pipe_rasterizer::point_quad_rasterization, and follow the weird OpenGL
rule when it is false.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
The adjustment needs to be applied to the y coordinates and not the x
coordinates, just like the equivalent code for lines and triangles in
lp_setup_line.c and lp_setup_tri.c.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
This was inadvertently forgotten when replacing gl_rasterization_rules
with lower_left_origin and half_pixel_center (commit
2737abb44e).
This makes a difference when lower_left_origin != half_pixel_center, e.g,
D3D10.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
We don't support MSAA (ie, number of samples is always one) therefore
sample_mask boils down to a synonym of the rasterizer_discard flag.
Also, this change makes setup actually use the value received in
lp_setup_set_rasterizer_discard instead of reaching out to llvmpipe
upper layers to re-fetch it.
Based on Si Chen's draft.
With this patch `wgf11multisample Coverage passes 100%` on the UMD
D3D10 state tracker.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Si Chen <sichen@vmware.com>
Implement Alpha to Coverage by discarding a fragment alpha component is
less than 0.5. This is a joint work of Jose and Si.
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
This fixes another case of faulting when freeing a pipe_sampler_view
that belongs to a previously destroyed context.
Cc: "10.0" <mesa-stable@lists.freedesktop.org>
Signed-off-by: Jonathan Liu <net147@gmail.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Didn't really work as well as hoped (in particular it was not generally
more accurate), will solve this differently.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This code was always problematic, and with 64bit rasterization we no longer
need it at all.
Reviewed-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
This patches add MESA_copy_sub_buffer support to the dri sw loader and
then to gallium state tracker, llvmpipe, softpipe and other bits.
It reuses the dri1 driver extension interface, and it updates the swrast
loader interface for a new putimage which can take a stride.
I've tested this with gnome-shell with a cogl hacked to reenable sub copies
for llvmpipe and the one piglit test.
I could probably split this patch up as well.
v2: pass a pipe_box, to reduce the entrypoints, as per Jose's review,
add to p_screen doc comments.
v3: finish off winsys interfaces, add swrast classic support as well.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
swrast: add support for copy_sub_buffer
With this patch llvmpipe will adhere to the ARB_depth_clamp enabled state when
clamping the fragment's zw value. To support this, the variant key now includes
the depth_clamp state. key->depth_clamp is derived from pipe_rasterizer_state's
(depth_clip == 0), thus depth clamp is only enabled when depth clip is disabled.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Disabled by default, but it's very useful when needed.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
The fact that we flush denorms to zero breaks our half-float
conversion and blending. This patches enables denorms for
blending. It's a little tricky due to the llvm bug that makes
it incorrectly reorder the mxcsr intrinsics:
http://llvm.org/bugs/show_bug.cgi?id=6393
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Signed-off-by: Zack Rusin <zackr@vmware.com>
With this patch, generate_fs_loop will clamp any fragment shader depth writes
to the viewport's min and max depth values. Viewport selection is determined
by the geometry shader output for the viewport array index. If no index is
specified, then the default viewport index is zero. Semantics for this path
can be found in draw_clamp_viewport_idx and lp_clamp_viewport_idx.
lp_jit_viewport was created to store viewport information visible to JIT code,
and is validated when the LP_NEW_VIEWPORT dirty flag is set.
lp_rast_shader_inputs is responsible for passing the viewport_index through
the rasterizer stage to fragment stage (via lp_jit_thread_data).
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Ever since introducing separate sampler and sampler view max this was really
missing.
Every driver but llvmpipe reports the same number as number of samplers for
now, so nothing should break.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
8 bit precision is required by d3d10 but unfortunately
requires 64 bit rasterizer. This commit implements
64 bit rasterization with full support for 8bit subpixel
precision. It's a combination of all individual commits
from the llvmpipe-rast-64 branch.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Some rounding errors could crop up when calculating a0. Use a more accurate
method (barycentric interpolation essentially) to fix this, though to fix
the REAL problem (which is that our interpolation will give very bad results
with small triangles far away from the origin when they have steep gradients)
this does absolutely nothing (actually makes it worse). (To fix the real
problem, either would need to use a vertex corner (or some other point inside
the tri) as starting point value instead of fb origin and pass that down to
interpolation, or mimic what hw does, use barycentric interpolation (using
the coordinates extracted from the rasterizer edge functions) - maybe another
time.)
Some (silly) tests though really want a high accuracy at fb origin and don't
care much about anything else (Just. Don't. Ask.).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
* minimise flags duplication
* distingush between VISIBILITY C and CXX flags
* set only required flags - C and/or CXX
v2: add LLVM_CFLAGS back to AM_CFLAGS (add missing backslash)
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
In particular get rid of home-grown vector helpers which didn't add much.
And while here fix formatting a bit. No functional change.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
d3d10 requires us to convert NaNs to zero for any float->int conversion.
We don't really do that but mostly seems to work. In particular I suspect the
very common float->unorm8 path only really passes because it relies on sse2
pack intrinsics which just happen to work by luck for NaNs (float->int
conversion in hw gives integer indeterminate value, which just happens to be
-0x80000000 hence gets converted to zero in the end after pack intrinsics).
However, float->srgb didn't get so lucky, because we need to clamp before
blending and clamping resulted in NaN behavior being undefined (and actually
got converted to 1.0 by clamping with sse2). Fix this by using a zero/one clamp
with defined nan behavior as we can handle the NaN for free this way.
I suspect there's more bugs lurking in this area (e.g. converting floats to
snorm) as we don't really use defined NaN behavior everywhere but this seems
to be good enough.
While here respecify nan behavior modes a bit, in particular the return_second
mode didn't really do what we wanted. From the caller's perspective, we really
wanted to say we need the non-nan result, but we already know the second arg
isn't a NaN. So we use this now instead, which means that cpu architectures
which actually implement min/max by always returning non-nan (that is adhering
to ieee754-2008 rules) don't need to bend over backwards for nothing.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Since we explicitly require a integer input we should avoid using exp2 math
(even if we were using optimized versions), which turns the exp2 into a int
sub (plus some casts).
v2: fix bogus uint (needs to be int) math spotted by Matthew, fix comments
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
With this patch, the llvmpipe and draw modules will calculate the depth bias
according to floating point depth buffer semantics described in the
arb_depth_buffer_float specification, when the driver has a z buffer bound
with a format type of UTIL_FORMAT_TYPE_FLOAT.
By default, the driver will use the existing UNORM calculation for depth bias.
A new function, draw_set_zs_format, was added to calculate the Minimum
Resolvable Depth value and floating point depth sense for the draw module.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
The layer coming from GS needs to be clamped (not sure if that's actually
the correct error behavior but we need something) as the number can be higher
than the amount of layers in the fb. However, this code was using the layer
calculation from the scene, and this was actually calculated in
lp_scene_begin_rasterization() hence too late (so setup was using the value
from the _previous_ scene or just zero if it was the first scene).
Since the value is used in both rasterization and setup, move calculation up
to lp_scene_begin_binning() though it's a bit more inconvenient to calculate
there. (Theoretically could move _all_ code which was in
lp_scene_begin_rasterization() to there, because ever since we got rid of
swizzled render/depth buffers our "map" functions preparing the fb data for
render don't actually change the data in there at all, but it feels like
it would be a hack.)
v2: improve comments
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This CAP will determine whether ARB_framebuffer_object can be enabled.
The nv30 driver does not allow mixing swizzled and linear zsbuf/cbuf
textures.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
The new function replaces four old functions: set_fragment/vertex/
geometry/compute_sampler_views().
Note: at this time, it's expected that the 'start' parameter will
always be zero.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Tested-by: Emil Velikov <emil.l.velikov@gmail.com>
This reverts commit 94d05bf87a as it has a
few problems:
- it breaks windows builds becuase env[LLVM_CXXFLAGS] is never set there
- it is merging not only rtti, but the whole cxxflags (defines etc)
which has proven to be a source of troubles (breaks debugging etc.)
* The rtti fix actually dug up a bug in the scons build scripts.
* Autotools took the LLVM cpp and cxx flags, while scons only took
the cpp flags.
* This grabs the cxx flags and applies them where needed. We may
want to make the same change for the llvm cpp flags in scons.
* The only linux platform I can find with LLVM no-rtti is Ubuntu.
* Fixes bug #70471
Tested-by: Vinson Lee <vlee@freedesktop.org>
The previous limit of of 128*1024 was reported to cause frequent recompiles
in some apps due to shader variant thrashing on IRC in some apps leading
to noticeable lags.
Note that the LP_MAX_SHADER_VARIANTS limit (1024) was more or less impossible
to reach, since even simple fragment shaders without texturing (glxgears) used
more than twice than 128 instructions, hence the instruction limit would have
always been reached first (excluding things like trivial shaders not writing
color). Even with the new limit it is VERY likely the instruction limit is hit
first.
Should help with such lags due to recompiles (though other shader types have
their own limits, LP_MAX_SETUP_VARIANTS and DRAW_MAX_SHADER_VARIANTS, in
particular the latter seems a bit small (128)).
Reviewed-by: Brian Paul <brianp@vmware.com>
Unless the polygon fill mode is different from PIPE_POLYGON_MODE_FILL,
so checking the the polygon mode is sufficient.
Testing done: no regression in polygon-mode-offset
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
As we're moving towards expanding the number of subpixel
bits and the width of the variables used in the computations
we need to make this code a bit more centralized.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
shader has already been dereferenced earlier so cannot be null here.
Fixes "Dereference before null check" defect reported by Coverity.
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Brian Paul <brianp@vmware.com>
We need to count the clipper primitives before the rasterizer
discards one it considers to be null.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
We need to subdivide triangles if either of the dimensions is
larger than the max edge length, not when both of them are larger.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
This reverts commit 755c11dc5e.
We agreed that this is band-aid that's not very useful and
the proper solution is to rewrite the rasterization algo
so that it operates on 64 bit values.
Signed-off-by: Zack Rusin <zackr@vmware.com>
When subdiving a triangle we're using a temporary array to store
the new coordinates for the subdivided triangles. Unfortunately
the array used for that was not aligned properly causing
random crashes in the llvm jit code which was trying to load
vectors from it.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Unfortunately d3d10 requires a lot higher precision (e.g.
wgf11clipping tests for it). The smallest number of precision
bits with which it passes is 8. That means that we need to
decrease the maximum length of an edge that we can handle without
subdivision by 4 bits. Abstracted the code a bit to make it easier
to change once to switch to 64bit rasterization.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
r600g needs explicit flushing before DRI2 buffers are presented on the screen.
v2: add (stub) implementations for all drivers, fix frontbuffer flushing
v3: fix galahad
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
We must take rounding in consideration when re-scaling to narrow
normalized channels, such as 2-bit normalized alpha.
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
In particular noone is interested in the vertex count, so drop that,
and also drop the duplicated num_primitives_generated /
so.primitives_storage_needed variables in drivers. I am unable for now to figure
out if primitives_storage_needed in SO stats (used for d3d10) should
increase if SO is disabled, though the equivalent num_primitives_generated
used for OpenGL definitely should increase. In any case we were only counting
when SO is active both in softpipe and llvmpipe anyway so don't pretend there's
an independent num_primitives_generated counter which would count always.
(This means the PIPE_QUERY_PRIMITIVES_GENERATED count will still be wrong just
as before, should eventually fix this by doing either separate counting for this
query or adjust the code so it always counts this even if SO is inactive depending
on what's correct for d3d10.)
Reviewed-by: Brian Paul <brianp@vmware.com>
There's a new debug value used to disable per-quad lod optimizations
in fragment shader (ignored for vs/gs as the results are just too wrong
typically). Also trying to detect if a supplied lod value is really a
scalar (if it's coming from immediate or constant file) in which case
sampler code can use this to stay on per-quad-lod path (in fact for
explicit lod could simplify even further and use same lod for both
quads in the avx case but this is not implemented yet).
Still need to actually implement per-element lod bias (and derivatives),
and need to handle per-element lod in size queries.
v2: fix comments, prettify.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This is a very well hidden bug found by accident (only the fixed glean
tstencil2 test so far seems to hit it).
We must use new mask with combined s_pass values and orig_mask values
for zpass/zfail stencil ops, otherwise both the sfail op and one of
zpass/zfail op are applied (probably not hit in most tests because
some of the ops tend to be KEEP usually).
Note: this is a candidate for the 9.2 branch.
Reviewed-by: Zack Rusin <zackr@vmware.com>
If the fragment shader is null then pixel shader invocations have
to be equal to zero. And if we're running a null ps then clipper
invocations and primitives should be equal to zero but only
if both stancil and depth testing are disabled.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
My previous attempt at doing so double-failed miserably (minification of
zero still gives one, and even if it would not the value was never written
anyway).
While here also rename the confusingly named int_vec bld as we have int vecs
of different sizes, and rename need_nr_mips (as this also changes out-of-bounds
behavior) to is_sviewinfo too.
Reviewed-by: Zack Rusin <zackr@vmware.com>
d3d10 has no notion of distinct array resources neither at the resource nor
sampler view level. However, shader dcl of resources certainly has, and
d3d10 expects resinfo to return the values according to that - in particular
a resource might have been a 1d texture with some array layers, then the
sampler view might have only used 1 layer so it can be accessed both as 1d
or 1d array texture (I think - the former definitely works). resinfo of a
resource decleared as array needs to return number of array layers but
non-array resource needs to return 0 (and not 1). Hence fix this by passing
the target from the shader decl to emit_size_query and use that (in case of
OpenGL the target will come from the instruction itself).
Could probably do the same for actual sampling, though it may not matter there
(as the bogus components will essentially get clamped away), possibly could
wreak havoc though if it REALLY doesn't match (which is of course an error
but still).
Reviewed-by: Zack Rusin <zackr@vmware.com>
Clearly the returned values need to be per-element if the lod is per element.
Does not actually change behavior yet.
Reviewed-by: Zack Rusin <zackr@vmware.com>
Nowadays -1 for slots means that the semantic is not present, so
we need to store it in a signed variables, otherwise <0 comparisons
are pointless. Fixes
http://bugzilla.eng.vmware.com/show_bug.cgi?id=67811 (at least
with softpipe, edgeflags don't work wit llvmpipe)
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
If gs is null, then freeing state->shader.tokens would result in a null
dereference.
Fixes "Dereference after null check" defect reported by Coverity.
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Brian Paul <brianp@vmware.com>
The loop was iterating over all the fs inputs and setting them
to perspective interpolation, then after the loop we were
creating extra output slots with the correct interpolation. Instead
of injecting bogus extra outputs, just set the interpolation
on front face and prim id correctly when doing the initial scan
of fs inputs.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Draw module can decompose primitives into wireframe models, which
is a fancy word for 'lines', unfortunately that decomposition means
that we weren't able to preserve the original front-face info which
could be derived from the original primitives (lines don't have a
'face'). To fix it allow draw module to inject a fake face semantic
into outputs from which the backends can figure out the original
frontfacing info of the primitives.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The spec says that front-face is true if the value is >0 and false
if it's <0. To make sure that we follow the spec, lets just
subtract 0.5 from our value (llvmpipe did 1 for frontface and 0
otherwise), which will get us a positive num for frontface and
negative for backface.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Test infs, zeros and nans with our arith functions to assure
correct/defined behavior with those values.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Usually with fixed point renderbuffers clamping is done as part of conversion.
However, since we blend in float format, we essentially skip all conversion
steps pre-blend but since this is still a fixed point renderbuffer we must
still clamp the inputs in this case. Makes no difference for piglit though.
Obviously we could skip this if fragment color clamping is enabled, but a)
this is deprecated in OpenGL (d3d never had it) and b) we don't support it
natively so it gets baked into the shader.
Also add some comment about logic ops being broken for srgb, luckily no test
tries to do that as there's no easy fix...
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
We were fixing up the blend factor to ZERO, however this only works correctly
with fixed point render buffers where the input values are clamped to 0/1
(because src_alpha_saturate is min(As, 1-Ad) so can be negative with unclamped
inputs). Haven't seen any failure anywhere due to that with fixed point SNORM
buffers (which clamp inputs to -1/1) but it should apply there as well (snorm
blending is rare, even opengl 4.3 doesn't require snorm rendertargets at all,
d3d10 requires them but they are not blendable).
Doesn't look like piglit hits this though (some internal testing hits the
float case at least). (With legacy OpenGL we could theoretically still use the
fixup to zero if the fragment color clamp is enabled, but we can't detect that
easily since we don't support native clamping hence it gets baked into the
shader.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Zack Rusin <zackr@vmware.com>
Just use the new conversion functions to do the work. The way it's plugged
in into the blend code is quite hacktastic but follows all the same hacks
as used by packed float format already.
Only support 4x8bit srgb formats (rgba/rgbx plus swizzle), 24bit formats never
worked anyway in the blend code and are thus disabled, and I don't think anyone
is interested in L8/L8A8. Would need even more hacks otherwise.
Unless I'm missing something, this is the last feature except MSAA needed for
OpenGL 3.0, and for OpenGL 3.1 as well I believe.
v2: prettify a bit, use separate function for packing.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The D3D10 spec is very explicit about treatment of denorm floats and
the behavior is exactly the same for them as it would be for -0 or
+0. This makes our shading code match that behavior, since OpenGL
doesn't care and on a few cpu's it's faster (worst case the same).
Float16 conversions will likely break but we'll fix them in a follow
up commit.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
d3d10 requires per-pixel lod calculations for explicit lod, lod bias and
explicit derivatives, and we should probably do it for OpenGL too - at least
if they are used from vertex or geometry shaders (so doesn't apply to lod
bias) this doesn't just affect neighboring pixels.
Some code was already there to handle this so fix it up and enable it.
There will no doubt be a performance hit unfortunately, we could do better
if we'd knew we had a real vector shift instruction (with variable shift
count) but this requires AVX2 on x86 (or a AMD Bulldozer family cpu).
Don't do anything for lod bias and explicit derivatives yet, though
no special magic should be needed for them neither.
Likewise, the size query is still broken just the same.
v2: Use information if lod is a (broadcast) scalar or not. The idea would be
to base this on the actual value, for now just pretend it's a scalar in fs
and not a scalar otherwise (so, per-pixel lod is only used in gs/vs but same
code is generated for fs as before).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
b04a295a4a removed seemingly unnecessary
code in get_query. Turns out this code could in fact be reached - while
timestamps are always binned, if there are no bins (which happens if fb
size is 0) then the rasterization query code filling this in is still
never executed.
So fix this up by filling in some timestamp, but do it at EndQuery time
not GetQuery time which should be more appropriate.
Makes piglit arb_timer_query-timestamp-get happy again.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This was just ignored (unless for some reason like unfilled polys draw was
handling this).
I'm not convinced of that code, putting the float for the clamp in the key
isn't really a good idea. Then again the other floats for depth bias are
already in there too anyway (should probably have a jit_context for the
setup function), so this is just a quick fix.
Also, the "minimum resolvable depth difference" used isn't really right as it
should be calculated according to the z values of the current primitive
and not be a constant (of course, this only makes a difference for float
depth buffers), at least for d3d10, so depth biasing is still not quite right.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
If there are queries active the opaque optimization reseting the bin needs to
be disabled.
(Not really tested since the bug was discovered by code inspection not
an actual test failure.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
OpenGL doesn't support this but d3d10 does.
It is a bit of a pain as it is necessary to keep track of queries
still active at the end of a scene, which is also why I cheat a bit
and limit the amount of simultaneously active queries to (arbitrary)
16 (simplifies things because don't have to deal with a real list
that way). I can't think of a reason why you'd really want large
numbers of overlapping/nested queries so it is hopefully fine.
(This only affects queries which need to be binned.)
v2: don't copy remainder of array when deleting an entry simply replace
the deleted entry with the last one (order doesn't matter).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Previously lp_rast_begin_query commands were always inserted into each bin,
and re-issued if the scene was restarted, while lp_rast_end_query commands
were executed for each still active query at the end of tile rasterization.
Also, the ps_invocations and vis_counter were set to zero when the respective
command was encountered.
This however cannot work for multiple queries of the same type (note that
occlusion counter and occlusion predicate while different type were also
affected).
So, change the logic to always set the ps_invocations and vis_counter to zero
at the start of tile rasterization, and then use "start" and "end" per-thread
query values when encountering the begin/end query commands instead, which
should work for multiple queries of the same type. This also means queries do
not have to be reissued in a new scene, however they still need to be finished
at end of tile rasterization, so a list of queries still active at the end of
a scene needs to be maintained.
Also while here don't bin the queries which don't do anything in rasterization.
(This change does not actually handle multiple queries of the same type yet,
as the list of active queries is just a simple fixed array and setup can still
only have one query active per type.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Squashed commit of the following:
commit 0857a7e105bfcbc4d1431b2cc56612094c747ca3
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:07 2013 -0400
gallivm: Fix lp_build_rgba8_to_fi32_soa for big endian
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 0d65131649a8aa140e2db228ba779d685c4333e3
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:07 2013 -0400
gallivm: Fix big-endian machines
This adds a bit-shift count to the format table, and adds the concept of
vector or bitwise alignment on gathers.
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 9740bda9b7dc894b629ed38be9b51059ce90818f
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:07 2013 -0400
llvmpipe: Fix convert_to_blend_type on big-endian
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit ae037c2de0f029e4e99371c0de25560484f0d8df
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
util: Convert color pack to packed formats
This fixes them on big-endian.
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 5b05ac0c89ae092ea8ba5bba9f739708d7396b5c
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
graw-xlib: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 51396e7d098cb6ff794391cf11afe4dbf86dbea0
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
format: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 417b60bc66eb450e68a92ab0e47f76e292b385e6
Author: Adam Jackson <ajax@redhat.com>
Date: Tue Jun 18 12:25:06 2013 -0400
st/dri: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 0934b2e022a5e0847d312c40734e2b44cac52fd8
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
st/xlib: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit a307ea3c3716a706963acce7966b5e405ba11db9
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
gbm: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 53eebdd253e1960a645ea278f31d7ef6a6cf4aeb
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
tests: Convert to packed formats
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 2f77fe3ee524945eacd546efcac34f7799fb3124
Author: Adam Jackson <ajax@redhat.com>
Date: Tue Jun 18 13:07:37 2013 -0400
gallium: Document packed formats
Signed-off-by: Adam Jackson <ajax@redhat.com>
commit 1f1017159ce951f922210a430de9229f91f62714
Author: Richard Sandiford <r.sandiford@uk.ibm.com>
Date: Tue Jun 18 12:25:06 2013 -0400
gallium: Introduce 32-bit packed format names
These are for interacting with buffers natively described in terms of
bit shifts, like X11 visuals:
uint32_t xyzw8888 = (x << 0) | (y << 8) | (z << 16) | (w << 24);
Define these in terms of (endian-dependent) aliases to the array-style
format names.
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Richard Sandiford <r.sandiford@uk.ibm.com>
commit 6cc7ab1ee66ed668da78c1d951dfd7782b4e786a
Author: Adam Jackson <ajax@redhat.com>
Date: Mon Jun 3 12:10:32 2013 -0400
gallium: Document format name conventions
v2:
- Fix a channel name thinko (Michel Dänzer)
- Elaborate on SCALED versus INT
- Add links to DirectX and FOURCC docs
Signed-off-by: Adam Jackson <ajax@redhat.com>
commit df4d269e7fb62051a3c029b84147465001e5776e
Author: Adam Jackson <ajax@redhat.com>
Date: Tue Jun 18 12:25:06 2013 -0400
gallivm: Remove all notion of byte-swapping
Signed-off-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Adam Jackson <ajax@redhat.com>
The result isn't always 0 in this case (depends on query type),
so instead of special casing this just use the ordinary path (should result
in correct values thanks to initialization in query_begin/end), just
skipping the fence wait.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Handle PIPE_QUERY_GPU_FINISHED and PIPE_QUERY_TIMESTAMP_DISJOINT, and
also fill out the ps_invocations and c_primitives from the
PIPE_QUERY_PIPELINE_STATISTICS (the others in there should already
be handled). Note that ps_invocations isn't pixel exact, just 16 pixel
exact but I guess it's better than nothing.
Doesn't really seem to work correctly but there's probably bugs elsewhere.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This reverts commit 41966fdb3b.
While it's a lot cleaner it causes regressions because
the draw interface is always called from the draw functions
of the drivers (because the buffers need to be mapped) which
means that the stream output buffers endup being cleared on
every draw rather than on setting.
Signed-off-by: Zack Rusin <zackr@vmware.com>
honor render_condition for clear_render_target and clear_depth_stencil.
Also add minimal support for occlusion predicate, though it can't be active
at the same time as an occlusion query yet.
While here also switchify some large if-else (actually just mutually
exclusive if-if-if...) constructs.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
For conditional rendering this makes it possible to skip rendering
if either the predicate is true or false, as supported by d3d10
(in fact previously it was sort of implied skip rendering if predicate
is false for occlusion predicate, and true for so_overflow predicate).
There's no cap bit for this as presumably all drivers could do it trivially
(but this patch does not implement it for the drivers using true
hw predicates, nvxx, r600, radeonsi, no change is expected for OpenGL
functionality).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Moves clearing of the draw so target buffers to the draw
module. They had to be cleared in the drivers before
which was quite messy.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Use new util_fill_box helper for util_clear_render_target.
(Also fix off-by-one map error.)
v2: handle non-zero z correctly in new helper
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Believe it or not but these two are actually the first two functions which
really belong in this file nowadays.
Reviewed-by: Brian Paul <brianp@vmware.com>
Mostly just make sure the layer parameter gets passed through to the right
places (and get clamped, can do this at setup time), fix up clears to
clear all layers and disable opaque optimization. Luckily don't need to
touch the jitted code.
(Clears invoked via pipe's clear_render_target method will not work however
since the pipe_util_clear function used for it doesn't handle clearing
multiple layers yet.)
v2: per Brian's suggestion, prettify var initialization and add some comments,
add assertion for impossible layer specification for surface.
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This was always doing per-pixel alignment which isn't necessary, except
for the buffer case (due to the per-element offset). The disabled code
for calculating it was incorrect because it assumed that always the full
block would be fetched, which may not be the case, so fix this up.
The original code failed for instance for r10g10b10a2 the alignment would
have been calculated as 4 (block_width) * 4 (bytes) so 16, but the actual
fetch may have only fetched 2 values at a time, hence only alignment 8 -
it is unclear what exactly would happen in this case (alignment larger
than size to fetch).
So just use the (already calculated) fetch size instead and get alignment
from that which should always work, no matter if fetching 1,2 or 4 pixels.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
For rendering to buffers, we cannot have any y alignment.
So make sure that tile clear commands only clear up to the fb width/height,
not more (do this for all resources actually as clearing more seems
pointless for other resources too). For the jit fs function, skip execution
of the lower half of the fragment shader for the 4x4 stamp completely,
for depth/stencil only load/store the values from the first row
(replace other row with undef).
For the blend function, also only load half the values from fs output,
replace the rest with undefs so that everything still operates on the
full 4x4 block to keep code the same between 4x1 and 4x4 (except for
load/store of course which also needs to skip (store) or replace these
values with undefs (load))., at the cost of slightly less optimal code
being produced in some cases.
Also reduce 1d and 1d array alignment too, because they can be handled the
same as buffers so don't need to waste memory.
v2: don't try to run special blend code for 4x1, (very) slightly less
complexity if we just use the same code as for 4x4 which may or may not
make it easier to optimize in the future (as we care a lot more about 4x4
performance than 1d).
v2: don't use undef values for unused fs src outputs with llvm 3.1 as it
apparently can trigger a bug in llvm.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Some parameters were used inconsistently, for instance not using
block_width/block_height/block_size for deferring number of pixels
but rather relying on guesses from the number of fragment shaders etc,
so fix this up (no actual change in behavior since the block size stays
fixed). (Though most of the code would work with different block_height,
with three exceptions, one being the hacked r11g11b10 conversions and
twiddle code which only work with block_height 2 not 1, and the last
one being blend vector type not being 128bit wide.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
There's no good reason why it can't handle 2x4f->1x8ub, 1x4f->1x4ub and
1x8f->1x8ub cases, there might be legitimate reasons why we don't have
enough input vectors for a full destination vector, and using pack
intrinsics should still be much better than using generic conversion
(it looks like convert_alpha from the blend code might hit this though
I suspect it could be avoided).
v2: add another test vector format to lp_test_conv so this gets tested.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
One of the assertion made no sense for buffer rendertargets
(due to the union), so drop it. (The same assertion is present already in
the path for texture surfaces later.).
v2: make assertion completely accurate (suggested by Jose).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The overallocation was very bad especially for things like 1d array
textures which got blown up by a factor of 64. (Even ordinary smallish
2d textures benefit a lot from this, a mipmapped 64x64 rgba8 texture
previously used 7*16kB = 112kB instead of now ~22kB.)
4x4 is chosen because this is the size the jit functions run on, so
making it smaller is going to be a bit more complicated.
It is actually not strictly 4x4 pixel, since we'd want to avoid situations
where different threads are rendering to the same cacheline so we keep
cacheline size alignment in x direction (often 64bytes).
To make this work introduce new task width/height parameters and make
sure clears don't clear the whole tile if it's a partial tile. Likewise,
the rasterizer may produce fragments outside the 4x4 blocks present in a
tile, so don't call the jit function for them.
This does not yet fix rendering to buffers (which cannot have any y
alignment at all), and 1d/1d array textures are still overallocated by a
factor of 4.
v2: replace magic number 4 with LP_RASTER_BLOCK_SIZE, fix size of buffers
allocated (needed in case we render to them).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
These were mostly just a waste of memory and cache pressure, and were
really only used for debugging.
This change reduces instruction count (as measured by callgrind's Ir
event) of gnome-shell-perf-tool on Ivybridge by 3.5% ± 0.015% (n=20).
Signed-off-by: Adam Jackson <ajax@redhat.com>
We need to clamp to make sure invalid shader doesn't crash our
driver. The spec says to return 0-th index for everything that's
out of bounds.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca<jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
draw_find_shader_output like most of the code in draw used to
depend on position always being at output slot 0. which meant
that any other attribute being at 0 could signify an error.
unfortunately position can be at any of the output slots, thus
other attributes can occupy slot 0 and we need to mark the ones
which were not found by something else. This commit changes
draw_find_shader_output so that it returns -1 if it can't
find the given attribute and adjust the code that depended
on it returning >0 whenever it correctly found an attrib.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca<jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Largely related to making sure the rasterizer can correctly
pick out the correct scissor box for the current viewport.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: José Fonseca<jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Gallium supported only a single viewport/scissor combination. This
commit changes the interface to allow us to add support for multiple
viewports/scissors.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Marek Olšák <maraeo@gmail.com>
Reviewed-by: José Fonseca<jfonseca@vmware.com>
Reviewed-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Roland Scheidegger <sroland@vmware.com>
Eliminate the rest of the no longer needed layout logic.
(It is possible some code could be simplified a bit further still.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This optimization disabled mask checks if the shader is simple enough.
While this should work correctly, the problem is that it can hide real issues
because shaders in practice are usually complex enough (8 instructions or 1
texture is already enough) so this doesn't get used, whereas dumbed-down
tests which should hit all the same code paths suddenly do something quite
different. This was the reason that bug 41787 could not be easily tracked as
stencil test not working correctly (piglit would in fact have failed some
tests without that optimization).
So disable it for now, it's unclear if it's much of a win in any case.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
We actually did early depth/stencil test and late depth/stencil write even
when the shader could kill the fragment (alpha test or discard). Since it
matters for the new stencil value if the fragment is killed by depth/stencil
test or by the shader (in which case it will not reach the depth/stencil
test) this simply cannot work (we also would possibly skip writing the new
stencil value due to mask checks but this is a secondary issue).
So use late depth test / late depth write instead in this case.
(No piglit changes as it doesn't seem to hit such bogus early depth test
/ late depth write path.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
We did mask checks between depth/stencil testing and depth/stencil write.
This meant that if the depth/stencil test killed off all fragments we never
actually wrote the new stencil value. This issue affected all early/late
test/write combinations.
So move the mask check after depth/stencil write (for early depth test,
could do the same for late depth test but might not be worth it at that
point so just skip it there).
This addresses https://bugs.freedesktop.org/show_bug.cgi?id=41787.
Piglit does not hit this issue because of the simple_shader optimization
in generate_fs_loop() which means we're skipping the mask checks.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
This was meant to disable some code which isn't needed when depth/stencil
isn't written. However, there's more code which wouldn't be needed in that
case so having the condition there was just odd (llvm will drop all the code
anyway).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>