gallium/docs - add OpenSWR documentation

Acked-by: Jose Fonseca <jfonseca@vmware.com>
2016-02-24 18:28:13 -06:00 · 2016-02-24 18:28:13 -06:00 · d003be2a30
parent da4f95d168
commit d003be2a30
5 changed files with 381 additions and 0 deletions
--- a/src/gallium/docs/source/drivers/openswr.rst
+++ b/src/gallium/docs/source/drivers/openswr.rst
@ -0,0 +1,21 @@
+OpenSWR
+=======
+
+The Gallium OpenSWR driver is a high performance, highly scalable
+software renderer targeted towards visualization workloads.  For such
+geometry heavy workloads there is a considerable speedup over llvmpipe,
+which is to be expected as the geometry frontend of llvmpipe is single
+threaded.
+
+This rasterizer is x86 specific and requires AVX or AVX2.  The driver
+fits into the gallium framework, and reuses gallivm for doing the TGSI
+to vectorized llvm-IR conversion of the shader kernels.
+
+.. toctree::
+   :glob:
+
+   openswr/usage
+   openswr/faq
+   openswr/profiling
+   openswr/knobs
+
--- a/src/gallium/docs/source/drivers/openswr/faq.rst
+++ b/src/gallium/docs/source/drivers/openswr/faq.rst
@ -0,0 +1,141 @@
+FAQ
+===
+
+Why another software rasterizer?
+--------------------------------
+
+Good question, given there are already three (swrast, softpipe,
+llvmpipe) in the Mesa3D tree. Two important reasons for this:
+
+ * Architecture - given our focus on scientific visualization, our
+   workloads are much different than the typical game; we have heavy
+   vertex load and relatively simple shaders.  In addition, the core
+   counts of machines we run on are much higher.  These parameters led
+   to design decisions much different than llvmpipe.
+
+ * Historical - Intel had developed a high performance software
+   graphics stack for internal purposes.  Later we adapted this
+   graphics stack for use in visualization and decided to move forward
+   with Mesa3D to provide a high quality API layer while at the same
+   time benefiting from the excellent performance the software
+   rasterizerizer gives us.
+
+What's the architecture?
+------------------------
+
+SWR is a tile based immediate mode renderer with a sort-free threading
+model which is arranged as a ring of queues.  Each entry in the ring
+represents a draw context that contains all of the draw state and work
+queues.  An API thread sets up each draw context and worker threads
+will execute both the frontend (vertex/geometry processing) and
+backend (fragment) work as required.  The ring allows for backend
+threads to pull work in order.  Large draws are split into chunks to
+allow vertex processing to happen in parallel, with the backend work
+pickup preserving draw ordering.
+
+Our pipeline uses just-in-time compiled code for the fetch shader that
+does vertex attribute gathering and AOS to SOA conversions, the vertex
+shader and fragment shaders, streamout, and fragment blending. SWR
+core also supports geometry and compute shaders but we haven't exposed
+them through our driver yet. The fetch shader, streamout, and blend is
+built internally to swr core using LLVM directly, while for the vertex
+and pixel shaders we reuse bits of llvmpipe from
+``gallium/auxiliary/gallivm`` to build the kernels, which we wrap
+differently than llvmpipe's ``auxiliary/draw`` code.
+
+What's the performance?
+-----------------------
+
+For the types of high-geometry workloads we're interested in, we are
+significantly faster than llvmpipe.  This is to be expected, as
+llvmpipe only threads the fragment processing and not the geometry
+frontend.  The performance advantage over llvmpipe roughly scales
+linearly with the number of cores available.
+
+While our current performance is quite good, we know there is more
+potential in this architecture.  When we switched from a prototype
+OpenGL driver to Mesa we regressed performance severely, some due to
+interface issues that need tuning, some differences in shader code
+generation, and some due to conformance and feature additions to the
+core swr.  We are looking to recovering most of this performance back.
+
+What's the conformance?
+-----------------------
+
+The major applications we are targeting are all based on the
+Visualization Toolkit (VTK), and as such our development efforts have
+been focused on making sure these work as best as possible.  Our
+current code passes vtk's rendering tests with their new "OpenGL2"
+(really OpenGL 3.2) backend at 99%.
+
+piglit testing shows a much lower pass rate, roughly 80% at the time
+of writing.  Core SWR undergoes rigorous unit testing and we are quite
+confident in the rasterizer, and understand the areas where it
+currently has issues (example: line rendering is done with triangles,
+so doesn't match the strict line rendering rules).  The majority of
+the piglit failures are errors in our driver layer interfacing Mesa
+and SWR.  Fixing these issues is one of our major future development
+goals.
+
+Why are you open sourcing this?
+-------------------------------
+
+ * Our customers prefer open source, and allowing them to simply
+   download the Mesa source and enable our driver makes life much
+   easier for them.
+
+ * The internal gallium APIs are not stable, so we'd like our driver
+   to be visible for changes.
+
+ * It's easier to work with the Mesa community when the source we're
+   working with can be used as reference.
+
+What are your development plans?
+--------------------------------
+
+ * Performance - see the performance section earlier for details.
+
+ * Conformance - see the conformance section earlier for details.
+
+ * Features - core SWR has a lot of functionality we have yet to
+   expose through our driver, such as MSAA, geometry shaders, compute
+   shaders, and tesselation.
+
+ * AVX512 support
+
+What is the licensing of the code?
+----------------------------------
+
+ * All code is under the normal Mesa MIT license.
+
+Will this work on AMD?
+----------------------
+
+ * If using an AMD processor with AVX or AVX2, it should work though
+   we don't have that hardware around to test.  Patches if needed
+   would be welcome.
+
+Will this work on ARM, MIPS, POWER, <other non-x86 architecture>?
+-------------------------------------------------------------------------
+
+ * Not without a lot of work.  We make extensive use of AVX and AVX2
+   intrinsics in our code and the in-tree JIT creation.  It is not the
+   intention for this codebase to support non-x86 architectures.
+
+What hardware do I need?
+------------------------
+
+ * Any x86 processor with at least AVX (introduced in the Intel
+   SandyBridge and AMD Bulldozer microarchitectures in 2011) will
+   work.
+
+ * You don't need a fire-breathing Xeon machine to work on SWR - we do
+   day-to-day development with laptops and desktop CPUs.
+
+Does one build work on both AVX and AVX2?
+-----------------------------------------
+
+Yes. The build system creates two shared libraries, ``libswrAVX.so`` and
+``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at
+runtime.
+
--- a/src/gallium/docs/source/drivers/openswr/knobs.rst
+++ b/src/gallium/docs/source/drivers/openswr/knobs.rst
@ -0,0 +1,114 @@
+Knobs
+=====
+
+OpenSWR has a number of environment variables which control its
+operation, in addition to the normal Mesa and gallium controls.
+
+.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
+
+Use dialogs when asserts fire. Asserts are only enabled in debug builds
+
+.. envvar:: KNOB_SINGLE_THREADED <bool> (false)
+
+If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
+
+.. envvar:: KNOB_DUMP_SHADER_IR <bool> (false)
+
+Dumps shader LLVM IR at various stages of jit compilation.
+
+.. envvar:: KNOB_USE_GENERIC_STORETILE <bool> (false)
+
+Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path
+
+.. envvar:: KNOB_FAST_CLEAR <bool> (true)
+
+Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store
+
+.. envvar:: KNOB_MAX_NUMA_NODES <uint32_t> (0)
+
+Maximum # of NUMA-nodes per system used for worker threads   0 == ALL NUMA-nodes in the system   N == Use at most N NUMA-nodes for rendering
+
+.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE <uint32_t> (0)
+
+Maximum # of cores per NUMA-node used for worker threads.   0 == ALL non-API thread cores per NUMA-node   N == Use at most N cores per NUMA-node
+
+.. envvar:: KNOB_MAX_THREADS_PER_CORE <uint32_t> (1)
+
+Maximum # of (hyper)threads per physical core used for worker threads.   0 == ALL hyper-threads per core   N == Use at most N hyper-threads per physical core
+
+.. envvar:: KNOB_MAX_WORKER_THREADS <uint32_t> (0)
+
+Maximum worker threads to spawn.  IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads.  They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored.
+
+.. envvar:: KNOB_BUCKETS_START_FRAME <uint32_t> (1200)
+
+Frame from when to start saving buckets data.  NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
+
+.. envvar:: KNOB_BUCKETS_END_FRAME <uint32_t> (1400)
+
+Frame at which to stop saving buckets data.  NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect.
+
+.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT <uint32_t> (5000)
+
+Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
+
+.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
+
+Maximum number of draws outstanding before API thread blocks.
+
+.. envvar:: KNOB_MAX_PRIMS_PER_DRAW <uint32_t> (2040)
+
+Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth).
+
+.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW <uint32_t> (16)
+
+Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
+
+.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
+
+(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
+
+.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
+
+(DEBUG) Maximum tessellation factor for fractional-even partitioning.
+
+.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
+
+(DEBUG) Maximum tessellation factor for integer partitioning.
+
+.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
+
+Enable threadviz output.
+
+.. envvar:: KNOB_TOSS_DRAW <bool> (false)
+
+Disable per-draw/dispatch execution
+
+.. envvar:: KNOB_TOSS_QUEUE_FE <bool> (false)
+
+Stop per-draw execution at worker FE  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_FETCH <bool> (false)
+
+Stop per-draw execution at vertex fetch  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_IA <bool> (false)
+
+Stop per-draw execution at input assembler  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_VS <bool> (false)
+
+Stop per-draw execution at vertex shader  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_SETUP_TRIS <bool> (false)
+
+Stop per-draw execution at primitive setup  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_BIN_TRIS <bool> (false)
+
+Stop per-draw execution at primitive binning  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
+.. envvar:: KNOB_TOSS_RS <bool> (false)
+
+Stop per-draw execution at rasterizer  NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h
+
--- a/src/gallium/docs/source/drivers/openswr/profiling.rst
+++ b/src/gallium/docs/source/drivers/openswr/profiling.rst
@ -0,0 +1,67 @@
+Profiling
+=========
+
+OpenSWR contains built-in profiling  which can be enabled
+at build time to provide insight into performance tuning.
+
+To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: ::
+
+  //#define KNOB_ENABLE_RDTSC
+
+Running an application will result in a ``rdtsc.txt`` file being
+created in current working directory.  This file contains profile
+information captured between the ``KNOB_BUCKETS_START_FRAME`` and
+``KNOB_BUCKETS_END_FRAME`` (see knobs section).
+
+The resulting file will contain sections for each thread with a
+hierarchical breakdown of the time spent in the various operations.
+For example: ::
+
+ Thread 0 (API)
+  %Tot   %Par  Cycles     CPE        NumEvent   CPE2       NumEvent2  Bucket
+   0.00   0.00 28370      2837       10         0          0          APIClearRenderTarget
+   0.00  41.23 11698      1169       10         0          0          |-> APIDrawWakeAllThreads
+   0.00  18.34 5202       520        10         0          0          |-> APIGetDrawContext
+  98.72  98.72 12413773688 29957      414380     0          0          APIDraw
+   0.36   0.36 44689364   107        414380     0          0          |-> APIDrawWakeAllThreads
+  96.36  97.62 12117951562 9747       1243140    0          0          |-> APIGetDrawContext
+   0.00   0.00 19904      995        20         0          0          APIStoreTiles
+   0.00   7.88 1568       78         20         0          0          |-> APIDrawWakeAllThreads
+   0.00  25.28 5032       251        20         0          0          |-> APIGetDrawContext
+   1.28   1.28 161344902  64         2486370    0          0          APIGetDrawContext
+   0.00   0.00 50368      2518       20         0          0          APISync
+   0.00   2.70 1360       68         20         0          0          |-> APIDrawWakeAllThreads
+   0.00  65.27 32876      1643       20         0          0          |-> APIGetDrawContext
+
+
+ Thread 1 (WORKER)
+  %Tot   %Par  Cycles     CPE        NumEvent   CPE2       NumEvent2  Bucket
+  83.92  83.92 13198987522 96411      136902     0          0          FEProcessDraw
+  24.91  29.69 3918184840 167        23410158   0          0          |-> FEFetchShader
+  11.17  13.31 1756972646 75         23410158   0          0          |-> FEVertexShader
+   8.89  10.59 1397902996 59         23410161   0          0          |-> FEPAAssemble
+  19.06  22.71 2997794710 384        7803387    0          0          |-> FEClipTriangles
+  11.67  61.21 1834958176 235        7803387    0          0              |-> FEBinTriangles
+   0.00   0.00 0          0          187258     0          0                  |-> FECullZeroAreaAndBackface
+   0.00   0.00 0          0          60051033   0          0                  |-> FECullBetweenCenters
+   0.11   0.11 17217556   2869592    6          0          0          FEProcessStoreTiles
+  15.97  15.97 2511392576 73665      34092      0          0          WorkerWorkOnFifoBE
+  14.04  87.95 2208687340 9187       240408     0          0          |-> WorkerFoundWork
+   0.06   0.43 9390536    13263      708        0          0              |-> BELoadTiles
+   0.00   0.01 293020     182        1609       0          0              |-> BEClear
+  12.63  89.94 1986508990 949        2093014    0          0              |-> BERasterizeTriangle
+   2.37  18.75 372374596  177        2093014    0          0                  |-> BETriangleSetup
+   0.42   3.35 66539016   31         2093014    0          0                  |-> BEStepSetup
+   0.00   0.00 0          0          21766      0          0                  |-> BETrivialReject
+   1.05   8.33 165410662  79         2071248    0          0                  |-> BERasterizePartial
+   6.06  48.02 953847796  1260       756783     0          0                  |-> BEPixelBackend
+   0.20   3.30 31521202   41         756783     0          0                      |-> BESetup
+   0.16   2.69 25624304   33         756783     0          0                      |-> BEBarycentric
+   0.18   2.92 27884986   36         756783     0          0                      |-> BEEarlyDepthTest
+   0.19   3.20 30564174   41         744058     0          0                      |-> BEPixelShader
+   0.26   4.30 41058646   55         744058     0          0                      |-> BEOutputMerger
+   1.27  20.94 199750822  32         6054264    0          0                      |-> BEEndTile
+   0.33   2.34 51758160   23687      2185       0          0              |-> BEStoreTiles
+   0.20  60.22 31169500   28807      1082       0          0                  |-> B8G8R8A8_UNORM
+   0.00   0.00 302752     302752     1          0          0          WorkerWaitForThreadEvent
+
--- a/src/gallium/docs/source/drivers/openswr/usage.rst
+++ b/src/gallium/docs/source/drivers/openswr/usage.rst
@ -0,0 +1,38 @@
+Usage
+=====
+
+Requirements
+^^^^^^^^^^^^
+
+* An x86 processor with AVX or AVX2
+* LLVM version 3.6 or later
+
+Building
+^^^^^^^^
+
+To build with GNU automake, select building the swr driver at
+configure time, for example: ::
+
+  configure --with-gallium-drivers=swrast,swr
+
+Using
+^^^^^
+
+On Linux, building will create a drop-in alternative for libGL.so into::
+
+  lib/gallium/libGL.so
+
+or::
+
+  build/foo/gallium/targets/libgl-xlib/libGL.so
+
+To use it set the LD_LIBRARY_PATH environment variable accordingly.
+
+**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer.  To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: ::
+
+  GALLIUM_DRIVER=swr
+
+To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: ::
+
+  SWR detected AVX2
+