gallium/docs - add OpenSWR documentation

Acked-by: Jose Fonseca <jfonseca@vmware.com>
This commit is contained in:
Tim Rowley 2016-02-24 18:28:13 -06:00
parent da4f95d168
commit d003be2a30
5 changed files with 381 additions and 0 deletions

View File

@ -0,0 +1,21 @@
OpenSWR
=======
The Gallium OpenSWR driver is a high performance, highly scalable
software renderer targeted towards visualization workloads. For such
geometry heavy workloads there is a considerable speedup over llvmpipe,
which is to be expected as the geometry frontend of llvmpipe is single
threaded.
This rasterizer is x86 specific and requires AVX or AVX2. The driver
fits into the gallium framework, and reuses gallivm for doing the TGSI
to vectorized llvm-IR conversion of the shader kernels.
.. toctree::
:glob:
openswr/usage
openswr/faq
openswr/profiling
openswr/knobs

View File

@ -0,0 +1,141 @@
FAQ
===
Why another software rasterizer?
--------------------------------
Good question, given there are already three (swrast, softpipe,
llvmpipe) in the Mesa3D tree. Two important reasons for this:
* Architecture - given our focus on scientific visualization, our
workloads are much different than the typical game; we have heavy
vertex load and relatively simple shaders. In addition, the core
counts of machines we run on are much higher. These parameters led
to design decisions much different than llvmpipe.
* Historical - Intel had developed a high performance software
graphics stack for internal purposes. Later we adapted this
graphics stack for use in visualization and decided to move forward
with Mesa3D to provide a high quality API layer while at the same
time benefiting from the excellent performance the software
rasterizerizer gives us.
What's the architecture?
------------------------
SWR is a tile based immediate mode renderer with a sort-free threading
model which is arranged as a ring of queues. Each entry in the ring
represents a draw context that contains all of the draw state and work
queues. An API thread sets up each draw context and worker threads
will execute both the frontend (vertex/geometry processing) and
backend (fragment) work as required. The ring allows for backend
threads to pull work in order. Large draws are split into chunks to
allow vertex processing to happen in parallel, with the backend work
pickup preserving draw ordering.
Our pipeline uses just-in-time compiled code for the fetch shader that
does vertex attribute gathering and AOS to SOA conversions, the vertex
shader and fragment shaders, streamout, and fragment blending. SWR
core also supports geometry and compute shaders but we haven't exposed
them through our driver yet. The fetch shader, streamout, and blend is
built internally to swr core using LLVM directly, while for the vertex
and pixel shaders we reuse bits of llvmpipe from
``gallium/auxiliary/gallivm`` to build the kernels, which we wrap
differently than llvmpipe's ``auxiliary/draw`` code.
What's the performance?
-----------------------
For the types of high-geometry workloads we're interested in, we are
significantly faster than llvmpipe. This is to be expected, as
llvmpipe only threads the fragment processing and not the geometry
frontend. The performance advantage over llvmpipe roughly scales
linearly with the number of cores available.
While our current performance is quite good, we know there is more
potential in this architecture. When we switched from a prototype
OpenGL driver to Mesa we regressed performance severely, some due to
interface issues that need tuning, some differences in shader code
generation, and some due to conformance and feature additions to the
core swr. We are looking to recovering most of this performance back.
What's the conformance?
-----------------------
The major applications we are targeting are all based on the
Visualization Toolkit (VTK), and as such our development efforts have
been focused on making sure these work as best as possible. Our
current code passes vtk's rendering tests with their new "OpenGL2"
(really OpenGL 3.2) backend at 99%.
piglit testing shows a much lower pass rate, roughly 80% at the time
of writing. Core SWR undergoes rigorous unit testing and we are quite
confident in the rasterizer, and understand the areas where it
currently has issues (example: line rendering is done with triangles,
so doesn't match the strict line rendering rules). The majority of
the piglit failures are errors in our driver layer interfacing Mesa
and SWR. Fixing these issues is one of our major future development
goals.
Why are you open sourcing this?
-------------------------------
* Our customers prefer open source, and allowing them to simply
download the Mesa source and enable our driver makes life much
easier for them.
* The internal gallium APIs are not stable, so we'd like our driver
to be visible for changes.
* It's easier to work with the Mesa community when the source we're
working with can be used as reference.
What are your development plans?
--------------------------------
* Performance - see the performance section earlier for details.
* Conformance - see the conformance section earlier for details.
* Features - core SWR has a lot of functionality we have yet to
expose through our driver, such as MSAA, geometry shaders, compute
shaders, and tesselation.
* AVX512 support
What is the licensing of the code?
----------------------------------
* All code is under the normal Mesa MIT license.
Will this work on AMD?
----------------------
* If using an AMD processor with AVX or AVX2, it should work though
we don't have that hardware around to test. Patches if needed
would be welcome.
Will this work on ARM, MIPS, POWER, <other non-x86 architecture>?
-------------------------------------------------------------------------
* Not without a lot of work. We make extensive use of AVX and AVX2
intrinsics in our code and the in-tree JIT creation. It is not the
intention for this codebase to support non-x86 architectures.
What hardware do I need?
------------------------
* Any x86 processor with at least AVX (introduced in the Intel
SandyBridge and AMD Bulldozer microarchitectures in 2011) will
work.
* You don't need a fire-breathing Xeon machine to work on SWR - we do
day-to-day development with laptops and desktop CPUs.
Does one build work on both AVX and AVX2?
-----------------------------------------
Yes. The build system creates two shared libraries, ``libswrAVX.so`` and
``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at
runtime.

View File

@ -0,0 +1,114 @@
Knobs
=====
OpenSWR has a number of environment variables which control its
operation, in addition to the normal Mesa and gallium controls.
.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
Use dialogs when asserts fire. Asserts are only enabled in debug builds
.. envvar:: KNOB_SINGLE_THREADED <bool> (false)
If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
.. envvar:: KNOB_DUMP_SHADER_IR <bool> (false)
Dumps shader LLVM IR at various stages of jit compilation.
.. envvar:: KNOB_USE_GENERIC_STORETILE <bool> (false)
Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path
.. envvar:: KNOB_FAST_CLEAR <bool> (true)
Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store
.. envvar:: KNOB_MAX_NUMA_NODES <uint32_t> (0)
Maximum # of NUMA-nodes per system used for worker threads 0 == ALL NUMA-nodes in the system N == Use at most N NUMA-nodes for rendering
.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE <uint32_t> (0)
Maximum # of cores per NUMA-node used for worker threads. 0 == ALL non-API thread cores per NUMA-node N == Use at most N cores per NUMA-node
.. envvar:: KNOB_MAX_THREADS_PER_CORE <uint32_t> (1)
Maximum # of (hyper)threads per physical core used for worker threads. 0 == ALL hyper-threads per core N == Use at most N hyper-threads per physical core
.. envvar:: KNOB_MAX_WORKER_THREADS <uint32_t> (0)
Maximum worker threads to spawn. IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads. They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored.
.. envvar:: KNOB_BUCKETS_START_FRAME <uint32_t> (1200)
Frame from when to start saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
.. envvar:: KNOB_BUCKETS_END_FRAME <uint32_t> (1400)
Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT <uint32_t> (5000)
Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
Maximum number of draws outstanding before API thread blocks.
.. envvar:: KNOB_MAX_PRIMS_PER_DRAW <uint32_t> (2040)
Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth).
.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW <uint32_t> (16)
Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
(DEBUG) Maximum tessellation factor for fractional-even partitioning.
.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
(DEBUG) Maximum tessellation factor for integer partitioning.
.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
Enable threadviz output.
.. envvar:: KNOB_TOSS_DRAW <bool> (false)
Disable per-draw/dispatch execution
.. envvar:: KNOB_TOSS_QUEUE_FE <bool> (false)
Stop per-draw execution at worker FE NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_FETCH <bool> (false)
Stop per-draw execution at vertex fetch NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_IA <bool> (false)
Stop per-draw execution at input assembler NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_VS <bool> (false)
Stop per-draw execution at vertex shader NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_SETUP_TRIS <bool> (false)
Stop per-draw execution at primitive setup NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_BIN_TRIS <bool> (false)
Stop per-draw execution at primitive binning NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
.. envvar:: KNOB_TOSS_RS <bool> (false)
Stop per-draw execution at rasterizer NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h

View File

@ -0,0 +1,67 @@
Profiling
=========
OpenSWR contains built-in profiling which can be enabled
at build time to provide insight into performance tuning.
To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: ::
//#define KNOB_ENABLE_RDTSC
Running an application will result in a ``rdtsc.txt`` file being
created in current working directory. This file contains profile
information captured between the ``KNOB_BUCKETS_START_FRAME`` and
``KNOB_BUCKETS_END_FRAME`` (see knobs section).
The resulting file will contain sections for each thread with a
hierarchical breakdown of the time spent in the various operations.
For example: ::
Thread 0 (API)
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
0.00 0.00 28370 2837 10 0 0 APIClearRenderTarget
0.00 41.23 11698 1169 10 0 0 |-> APIDrawWakeAllThreads
0.00 18.34 5202 520 10 0 0 |-> APIGetDrawContext
98.72 98.72 12413773688 29957 414380 0 0 APIDraw
0.36 0.36 44689364 107 414380 0 0 |-> APIDrawWakeAllThreads
96.36 97.62 12117951562 9747 1243140 0 0 |-> APIGetDrawContext
0.00 0.00 19904 995 20 0 0 APIStoreTiles
0.00 7.88 1568 78 20 0 0 |-> APIDrawWakeAllThreads
0.00 25.28 5032 251 20 0 0 |-> APIGetDrawContext
1.28 1.28 161344902 64 2486370 0 0 APIGetDrawContext
0.00 0.00 50368 2518 20 0 0 APISync
0.00 2.70 1360 68 20 0 0 |-> APIDrawWakeAllThreads
0.00 65.27 32876 1643 20 0 0 |-> APIGetDrawContext
Thread 1 (WORKER)
%Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket
83.92 83.92 13198987522 96411 136902 0 0 FEProcessDraw
24.91 29.69 3918184840 167 23410158 0 0 |-> FEFetchShader
11.17 13.31 1756972646 75 23410158 0 0 |-> FEVertexShader
8.89 10.59 1397902996 59 23410161 0 0 |-> FEPAAssemble
19.06 22.71 2997794710 384 7803387 0 0 |-> FEClipTriangles
11.67 61.21 1834958176 235 7803387 0 0 |-> FEBinTriangles
0.00 0.00 0 0 187258 0 0 |-> FECullZeroAreaAndBackface
0.00 0.00 0 0 60051033 0 0 |-> FECullBetweenCenters
0.11 0.11 17217556 2869592 6 0 0 FEProcessStoreTiles
15.97 15.97 2511392576 73665 34092 0 0 WorkerWorkOnFifoBE
14.04 87.95 2208687340 9187 240408 0 0 |-> WorkerFoundWork
0.06 0.43 9390536 13263 708 0 0 |-> BELoadTiles
0.00 0.01 293020 182 1609 0 0 |-> BEClear
12.63 89.94 1986508990 949 2093014 0 0 |-> BERasterizeTriangle
2.37 18.75 372374596 177 2093014 0 0 |-> BETriangleSetup
0.42 3.35 66539016 31 2093014 0 0 |-> BEStepSetup
0.00 0.00 0 0 21766 0 0 |-> BETrivialReject
1.05 8.33 165410662 79 2071248 0 0 |-> BERasterizePartial
6.06 48.02 953847796 1260 756783 0 0 |-> BEPixelBackend
0.20 3.30 31521202 41 756783 0 0 |-> BESetup
0.16 2.69 25624304 33 756783 0 0 |-> BEBarycentric
0.18 2.92 27884986 36 756783 0 0 |-> BEEarlyDepthTest
0.19 3.20 30564174 41 744058 0 0 |-> BEPixelShader
0.26 4.30 41058646 55 744058 0 0 |-> BEOutputMerger
1.27 20.94 199750822 32 6054264 0 0 |-> BEEndTile
0.33 2.34 51758160 23687 2185 0 0 |-> BEStoreTiles
0.20 60.22 31169500 28807 1082 0 0 |-> B8G8R8A8_UNORM
0.00 0.00 302752 302752 1 0 0 WorkerWaitForThreadEvent

View File

@ -0,0 +1,38 @@
Usage
=====
Requirements
^^^^^^^^^^^^
* An x86 processor with AVX or AVX2
* LLVM version 3.6 or later
Building
^^^^^^^^
To build with GNU automake, select building the swr driver at
configure time, for example: ::
configure --with-gallium-drivers=swrast,swr
Using
^^^^^
On Linux, building will create a drop-in alternative for libGL.so into::
lib/gallium/libGL.so
or::
build/foo/gallium/targets/libgl-xlib/libGL.so
To use it set the LD_LIBRARY_PATH environment variable accordingly.
**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer. To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: ::
GALLIUM_DRIVER=swr
To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: ::
SWR detected AVX2