gallium/docs - add OpenSWR documentation
Acked-by: Jose Fonseca <jfonseca@vmware.com>
This commit is contained in:
parent
da4f95d168
commit
d003be2a30
|
@ -0,0 +1,21 @@
|
|||
OpenSWR
|
||||
=======
|
||||
|
||||
The Gallium OpenSWR driver is a high performance, highly scalable
|
||||
software renderer targeted towards visualization workloads. For such
|
||||
geometry heavy workloads there is a considerable speedup over llvmpipe,
|
||||
which is to be expected as the geometry frontend of llvmpipe is single
|
||||
threaded.
|
||||
|
||||
This rasterizer is x86 specific and requires AVX or AVX2. The driver
|
||||
fits into the gallium framework, and reuses gallivm for doing the TGSI
|
||||
to vectorized llvm-IR conversion of the shader kernels.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
|
||||
openswr/usage
|
||||
openswr/faq
|
||||
openswr/profiling
|
||||
openswr/knobs
|
||||
|
|
@ -0,0 +1,141 @@
|
|||
FAQ
|
||||
===
|
||||
|
||||
Why another software rasterizer?
|
||||
--------------------------------
|
||||
|
||||
Good question, given there are already three (swrast, softpipe,
|
||||
llvmpipe) in the Mesa3D tree. Two important reasons for this:
|
||||
|
||||
* Architecture - given our focus on scientific visualization, our
|
||||
workloads are much different than the typical game; we have heavy
|
||||
vertex load and relatively simple shaders. In addition, the core
|
||||
counts of machines we run on are much higher. These parameters led
|
||||
to design decisions much different than llvmpipe.
|
||||
|
||||
* Historical - Intel had developed a high performance software
|
||||
graphics stack for internal purposes. Later we adapted this
|
||||
graphics stack for use in visualization and decided to move forward
|
||||
with Mesa3D to provide a high quality API layer while at the same
|
||||
time benefiting from the excellent performance the software
|
||||
rasterizerizer gives us.
|
||||
|
||||
What's the architecture?
|
||||
------------------------
|
||||
|
||||
SWR is a tile based immediate mode renderer with a sort-free threading
|
||||
model which is arranged as a ring of queues. Each entry in the ring
|
||||
represents a draw context that contains all of the draw state and work
|
||||
queues. An API thread sets up each draw context and worker threads
|
||||
will execute both the frontend (vertex/geometry processing) and
|
||||
backend (fragment) work as required. The ring allows for backend
|
||||
threads to pull work in order. Large draws are split into chunks to
|
||||
allow vertex processing to happen in parallel, with the backend work
|
||||
pickup preserving draw ordering.
|
||||
|
||||
Our pipeline uses just-in-time compiled code for the fetch shader that
|
||||
does vertex attribute gathering and AOS to SOA conversions, the vertex
|
||||
shader and fragment shaders, streamout, and fragment blending. SWR
|
||||
core also supports geometry and compute shaders but we haven't exposed
|
||||
them through our driver yet. The fetch shader, streamout, and blend is
|
||||
built internally to swr core using LLVM directly, while for the vertex
|
||||
and pixel shaders we reuse bits of llvmpipe from
|
||||
``gallium/auxiliary/gallivm`` to build the kernels, which we wrap
|
||||
differently than llvmpipe's ``auxiliary/draw`` code.
|
||||
|
||||
What's the performance?
|
||||
-----------------------
|
||||
|
||||
For the types of high-geometry workloads we're interested in, we are
|
||||
significantly faster than llvmpipe. This is to be expected, as
|
||||
llvmpipe only threads the fragment processing and not the geometry
|
||||
frontend. The performance advantage over llvmpipe roughly scales
|
||||
linearly with the number of cores available.
|
||||
|
||||
While our current performance is quite good, we know there is more
|
||||
potential in this architecture. When we switched from a prototype
|
||||
OpenGL driver to Mesa we regressed performance severely, some due to
|
||||
interface issues that need tuning, some differences in shader code
|
||||
generation, and some due to conformance and feature additions to the
|
||||
core swr. We are looking to recovering most of this performance back.
|
||||
|
||||
What's the conformance?
|
||||
-----------------------
|
||||
|
||||
The major applications we are targeting are all based on the
|
||||
Visualization Toolkit (VTK), and as such our development efforts have
|
||||
been focused on making sure these work as best as possible. Our
|
||||
current code passes vtk's rendering tests with their new "OpenGL2"
|
||||
(really OpenGL 3.2) backend at 99%.
|
||||
|
||||
piglit testing shows a much lower pass rate, roughly 80% at the time
|
||||
of writing. Core SWR undergoes rigorous unit testing and we are quite
|
||||
confident in the rasterizer, and understand the areas where it
|
||||
currently has issues (example: line rendering is done with triangles,
|
||||
so doesn't match the strict line rendering rules). The majority of
|
||||
the piglit failures are errors in our driver layer interfacing Mesa
|
||||
and SWR. Fixing these issues is one of our major future development
|
||||
goals.
|
||||
|
||||
Why are you open sourcing this?
|
||||
-------------------------------
|
||||
|
||||
* Our customers prefer open source, and allowing them to simply
|
||||
download the Mesa source and enable our driver makes life much
|
||||
easier for them.
|
||||
|
||||
* The internal gallium APIs are not stable, so we'd like our driver
|
||||
to be visible for changes.
|
||||
|
||||
* It's easier to work with the Mesa community when the source we're
|
||||
working with can be used as reference.
|
||||
|
||||
What are your development plans?
|
||||
--------------------------------
|
||||
|
||||
* Performance - see the performance section earlier for details.
|
||||
|
||||
* Conformance - see the conformance section earlier for details.
|
||||
|
||||
* Features - core SWR has a lot of functionality we have yet to
|
||||
expose through our driver, such as MSAA, geometry shaders, compute
|
||||
shaders, and tesselation.
|
||||
|
||||
* AVX512 support
|
||||
|
||||
What is the licensing of the code?
|
||||
----------------------------------
|
||||
|
||||
* All code is under the normal Mesa MIT license.
|
||||
|
||||
Will this work on AMD?
|
||||
----------------------
|
||||
|
||||
* If using an AMD processor with AVX or AVX2, it should work though
|
||||
we don't have that hardware around to test. Patches if needed
|
||||
would be welcome.
|
||||
|
||||
Will this work on ARM, MIPS, POWER, <other non-x86 architecture>?
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
* Not without a lot of work. We make extensive use of AVX and AVX2
|
||||
intrinsics in our code and the in-tree JIT creation. It is not the
|
||||
intention for this codebase to support non-x86 architectures.
|
||||
|
||||
What hardware do I need?
|
||||
------------------------
|
||||
|
||||
* Any x86 processor with at least AVX (introduced in the Intel
|
||||
SandyBridge and AMD Bulldozer microarchitectures in 2011) will
|
||||
work.
|
||||
|
||||
* You don't need a fire-breathing Xeon machine to work on SWR - we do
|
||||
day-to-day development with laptops and desktop CPUs.
|
||||
|
||||
Does one build work on both AVX and AVX2?
|
||||
-----------------------------------------
|
||||
|
||||
Yes. The build system creates two shared libraries, ``libswrAVX.so`` and
|
||||
``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at
|
||||
runtime.
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
Knobs
|
||||
=====
|
||||
|
||||
OpenSWR has a number of environment variables which control its
|
||||
operation, in addition to the normal Mesa and gallium controls.
|
||||
|
||||
.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
|
||||
|
||||
Use dialogs when asserts fire. Asserts are only enabled in debug builds
|
||||
|
||||
.. envvar:: KNOB_SINGLE_THREADED <bool> (false)
|
||||
|
||||
If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
|
||||
|
||||
.. envvar:: KNOB_DUMP_SHADER_IR <bool> (false)
|
||||
|
||||
Dumps shader LLVM IR at various stages of jit compilation.
|
||||
|
||||
.. envvar:: KNOB_USE_GENERIC_STORETILE <bool> (false)
|
||||
|
||||
Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path
|
||||
|
||||
.. envvar:: KNOB_FAST_CLEAR <bool> (true)
|
||||
|
||||
Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store
|
||||
|
||||
.. envvar:: KNOB_MAX_NUMA_NODES <uint32_t> (0)
|
||||
|
||||
Maximum # of NUMA-nodes per system used for worker threads 0 == ALL NUMA-nodes in the system N == Use at most N NUMA-nodes for rendering
|
||||
|
||||
.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE <uint32_t> (0)
|
||||
|
||||
Maximum # of cores per NUMA-node used for worker threads. 0 == ALL non-API thread cores per NUMA-node N == Use at most N cores per NUMA-node
|
||||
|
||||
.. envvar:: KNOB_MAX_THREADS_PER_CORE <uint32_t> (1)
|
||||
|
||||
Maximum # of (hyper)threads per physical core used for worker threads. 0 == ALL hyper-threads per core N == Use at most N hyper-threads per physical core
|
||||
|
||||
.. envvar:: KNOB_MAX_WORKER_THREADS <uint32_t> (0)
|
||||
|
||||
Maximum worker threads to spawn. IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads. They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_START_FRAME <uint32_t> (1200)
|
||||
|
||||
Frame from when to start saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_END_FRAME <uint32_t> (1400)
|
||||
|
||||
Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
|
||||
|
||||
.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT <uint32_t> (5000)
|
||||
|
||||
Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
|
||||
|
||||
.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
|
||||
|
||||
Maximum number of draws outstanding before API thread blocks.
|
||||
|
||||
.. envvar:: KNOB_MAX_PRIMS_PER_DRAW <uint32_t> (2040)
|
||||
|
||||
Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth).
|
||||
|
||||
.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW <uint32_t> (16)
|
||||
|
||||
Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
|
||||
|
||||
.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
|
||||
|
||||
.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for fractional-even partitioning.
|
||||
|
||||
.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
|
||||
|
||||
(DEBUG) Maximum tessellation factor for integer partitioning.
|
||||
|
||||
.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
|
||||
|
||||
Enable threadviz output.
|
||||
|
||||
.. envvar:: KNOB_TOSS_DRAW <bool> (false)
|
||||
|
||||
Disable per-draw/dispatch execution
|
||||
|
||||
.. envvar:: KNOB_TOSS_QUEUE_FE <bool> (false)
|
||||
|
||||
Stop per-draw execution at worker FE NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_FETCH <bool> (false)
|
||||
|
||||
Stop per-draw execution at vertex fetch NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_IA <bool> (false)
|
||||
|
||||
Stop per-draw execution at input assembler NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_VS <bool> (false)
|
||||
|
||||
Stop per-draw execution at vertex shader NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_SETUP_TRIS <bool> (false)
|
||||
|
||||
Stop per-draw execution at primitive setup NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_BIN_TRIS <bool> (false)
|
||||
|
||||
Stop per-draw execution at primitive binning NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
||||
.. envvar:: KNOB_TOSS_RS <bool> (false)
|
||||
|
||||
Stop per-draw execution at rasterizer NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
Profiling
|
||||
=========
|
||||
|
||||
OpenSWR contains built-in profiling which can be enabled
|
||||
at build time to provide insight into performance tuning.
|
||||
|
||||
To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: ::
|
||||
|
||||
//#define KNOB_ENABLE_RDTSC
|
||||
|
||||
Running an application will result in a ``rdtsc.txt`` file being
|
||||
created in current working directory. This file contains profile
|
||||
information captured between the ``KNOB_BUCKETS_START_FRAME`` and
|
||||
``KNOB_BUCKETS_END_FRAME`` (see knobs section).
|
||||
|
||||
The resulting file will contain sections for each thread with a
|
||||
hierarchical breakdown of the time spent in the various operations.
|
||||
For example: ::
|
||||
|
||||
Thread 0 (API)
|
||||
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
|
||||
0.00 0.00 28370 2837 10 0 0 APIClearRenderTarget
|
||||
0.00 41.23 11698 1169 10 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 18.34 5202 520 10 0 0 |-> APIGetDrawContext
|
||||
98.72 98.72 12413773688 29957 414380 0 0 APIDraw
|
||||
0.36 0.36 44689364 107 414380 0 0 |-> APIDrawWakeAllThreads
|
||||
96.36 97.62 12117951562 9747 1243140 0 0 |-> APIGetDrawContext
|
||||
0.00 0.00 19904 995 20 0 0 APIStoreTiles
|
||||
0.00 7.88 1568 78 20 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 25.28 5032 251 20 0 0 |-> APIGetDrawContext
|
||||
1.28 1.28 161344902 64 2486370 0 0 APIGetDrawContext
|
||||
0.00 0.00 50368 2518 20 0 0 APISync
|
||||
0.00 2.70 1360 68 20 0 0 |-> APIDrawWakeAllThreads
|
||||
0.00 65.27 32876 1643 20 0 0 |-> APIGetDrawContext
|
||||
|
||||
|
||||
Thread 1 (WORKER)
|
||||
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
|
||||
83.92 83.92 13198987522 96411 136902 0 0 FEProcessDraw
|
||||
24.91 29.69 3918184840 167 23410158 0 0 |-> FEFetchShader
|
||||
11.17 13.31 1756972646 75 23410158 0 0 |-> FEVertexShader
|
||||
8.89 10.59 1397902996 59 23410161 0 0 |-> FEPAAssemble
|
||||
19.06 22.71 2997794710 384 7803387 0 0 |-> FEClipTriangles
|
||||
11.67 61.21 1834958176 235 7803387 0 0 |-> FEBinTriangles
|
||||
0.00 0.00 0 0 187258 0 0 |-> FECullZeroAreaAndBackface
|
||||
0.00 0.00 0 0 60051033 0 0 |-> FECullBetweenCenters
|
||||
0.11 0.11 17217556 2869592 6 0 0 FEProcessStoreTiles
|
||||
15.97 15.97 2511392576 73665 34092 0 0 WorkerWorkOnFifoBE
|
||||
14.04 87.95 2208687340 9187 240408 0 0 |-> WorkerFoundWork
|
||||
0.06 0.43 9390536 13263 708 0 0 |-> BELoadTiles
|
||||
0.00 0.01 293020 182 1609 0 0 |-> BEClear
|
||||
12.63 89.94 1986508990 949 2093014 0 0 |-> BERasterizeTriangle
|
||||
2.37 18.75 372374596 177 2093014 0 0 |-> BETriangleSetup
|
||||
0.42 3.35 66539016 31 2093014 0 0 |-> BEStepSetup
|
||||
0.00 0.00 0 0 21766 0 0 |-> BETrivialReject
|
||||
1.05 8.33 165410662 79 2071248 0 0 |-> BERasterizePartial
|
||||
6.06 48.02 953847796 1260 756783 0 0 |-> BEPixelBackend
|
||||
0.20 3.30 31521202 41 756783 0 0 |-> BESetup
|
||||
0.16 2.69 25624304 33 756783 0 0 |-> BEBarycentric
|
||||
0.18 2.92 27884986 36 756783 0 0 |-> BEEarlyDepthTest
|
||||
0.19 3.20 30564174 41 744058 0 0 |-> BEPixelShader
|
||||
0.26 4.30 41058646 55 744058 0 0 |-> BEOutputMerger
|
||||
1.27 20.94 199750822 32 6054264 0 0 |-> BEEndTile
|
||||
0.33 2.34 51758160 23687 2185 0 0 |-> BEStoreTiles
|
||||
0.20 60.22 31169500 28807 1082 0 0 |-> B8G8R8A8_UNORM
|
||||
0.00 0.00 302752 302752 1 0 0 WorkerWaitForThreadEvent
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
Usage
|
||||
=====
|
||||
|
||||
Requirements
|
||||
^^^^^^^^^^^^
|
||||
|
||||
* An x86 processor with AVX or AVX2
|
||||
* LLVM version 3.6 or later
|
||||
|
||||
Building
|
||||
^^^^^^^^
|
||||
|
||||
To build with GNU automake, select building the swr driver at
|
||||
configure time, for example: ::
|
||||
|
||||
configure --with-gallium-drivers=swrast,swr
|
||||
|
||||
Using
|
||||
^^^^^
|
||||
|
||||
On Linux, building will create a drop-in alternative for libGL.so into::
|
||||
|
||||
lib/gallium/libGL.so
|
||||
|
||||
or::
|
||||
|
||||
build/foo/gallium/targets/libgl-xlib/libGL.so
|
||||
|
||||
To use it set the LD_LIBRARY_PATH environment variable accordingly.
|
||||
|
||||
**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer. To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: ::
|
||||
|
||||
GALLIUM_DRIVER=swr
|
||||
|
||||
To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: ::
|
||||
|
||||
SWR detected AVX2
|
||||
|
Loading…
Reference in New Issue