From 8248cc0bf45d0d7558cc3d77a63dcd078a96aa66 Mon Sep 17 00:00:00 2001 From: Erik Faye-Lund Date: Fri, 26 Apr 2024 16:08:22 +0200 Subject: [PATCH] docs/panfrost: move details to separate articles The front-page of the docs is currently fairly intimidating, by diving into details rather abruptly. Let's try to make it a bit easier to navigate t by moving the details to their own articles, but linking them from the front-page. Acked-by: Daniel Stone Part-of: --- docs/drivers/panfrost.rst | 257 +-------------------------- docs/drivers/panfrost/drm-shim.rst | 84 +++++++++ docs/drivers/panfrost/instancing.rst | 112 ++++++++++++ docs/drivers/panfrost/texcomp.rst | 17 ++ docs/drivers/panfrost/tiling.rst | 38 ++++ 5 files changed, 258 insertions(+), 250 deletions(-) create mode 100644 docs/drivers/panfrost/drm-shim.rst create mode 100644 docs/drivers/panfrost/instancing.rst create mode 100644 docs/drivers/panfrost/texcomp.rst create mode 100644 docs/drivers/panfrost/tiling.rst diff --git a/docs/drivers/panfrost.rst b/docs/drivers/panfrost.rst index 1be447a89619c..05a1a3d858bcf 100644 --- a/docs/drivers/panfrost.rst +++ b/docs/drivers/panfrost.rst @@ -25,7 +25,7 @@ hardware is currently supported: Other Midgard and Bifrost chips (e.g. G71) are not yet supported. -Older Mali chips based on the Utgard architecture (Mali 400, Mali 450) are +Older Mali chips based on the Utgard architecture (Mali-400, Mali-450) are supported in the :doc:`Lima ` driver, not Panfrost. Lima is also available in Mesa. @@ -57,255 +57,12 @@ Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note that registering and authenticating with ``NickServ`` is required to prevent spam. `Join the chat. `_ -Compressed texture support --------------------------- +Technical details +----------------- -In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC, -S3TC, etc.) However, Panfrost depends on the hardware to support these formats -efficiently. All supported Mali architectures support these formats, but not -every system-on-chip with a Mali GPU support all these formats. Many lower-end -systems lack support for some BCn formats, which can cause problems when playing -desktop games with Panfrost. To check whether this issue applies to your -system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query -supported formats. +You can read more technical details about Panfrost here: -To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa. -Then inside your Mesa build directory, the tool is located at -``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device, -set as executable as necessary, and run on the target device. A table of -supported formats will be printed to standard output. +.. toctree:: + :glob: -drm-shim --------- - -Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface. -Use cases for this functionality include: - -- Future hardware bring up -- Running shader-db on non-Mali workstations -- Reproducing compiler (and some driver) bugs without Mali hardware - -Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C -code and should work on any Linux machine. In particular, you can test the -compiler on shader-db on an Intel desktop. - -To build Mesa with Panfrost drm-shim, configure Meson with -``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above -building section for a full invocation. The drm-shim binary will be built to -``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``. - -To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary. It -may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa -was installed. - -By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU, -set the ``PAN_GPU_ID`` environment variable to the desired GPU ID: - -========= ============= ======= -Product Architecture GPU ID -========= ============= ======= -Mali-T720 Midgard (v4) 720 -Mali-T860 Midgard (v5) 860 -Mali-G72 Bifrost (v6) 6221 -Mali-G52 Bifrost (v7) 7212 -Mali-G57 Valhall (v9) 9093 -Mali-G610 Valhall (v10) a867 -========= ============= ======= - -Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in -``src/panfrost/lib/pan_props.c``. - -As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's -build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as: - -.. code-block:: sh - - ~/shader-db$ BIFROST_MESA_DEBUG=shaders \ - LIBGL_DRIVERS_PATH=~/lib/dri/ \ - LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ - PAN_GPU_ID=7212 \ - ./run shaders/glmark/1-1.shader_test - -The same shader can be compiled for Mali-T720 as: - -.. code-block:: sh - - ~/shader-db$ MIDGARD_MESA_DEBUG=shaders \ - LIBGL_DRIVERS_PATH=~/lib/dri/ \ - LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ - PAN_GPU_ID=720 \ - ./run shaders/glmark/1-1.shader_test - -These examples set the compilers' ``shaders`` debug flags to dump the optimized -NIR, backend IR after instruction selection, backend IR after register -allocation and scheduling, and a disassembly of the final compiled binary. - -As another example, this invocation runs a single dEQP test "on" Mali-G52, -pretty-printing GPU data structures and disassembling all shaders -(``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory -(``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable -and various flags to dEQP mimic the surfaceless environment that our -continuous integration (CI) uses. This eliminates window system dependencies, -although it requires a specially built CTS: - -.. code-block:: sh - - ~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \ - LIBGL_DRIVERS_PATH=~/lib/dri/ \ - LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ - PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \ - ./glcts --deqp-surface-type=pbuffer \ - --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \ - --deqp-surface-height=256 -n \ - dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute - -U-interleaved tiling ---------------------- - -Panfrost supports u-interleaved tiling. U-interleaved tiling is -indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier. - -The tiling reorders whole pixels (blocks). It does not compress or modify the -pixels themselves, so it can be used for any image format. Internally, images -are divided into tiles. Tiles occur in source order, but pixels (blocks) within -each tile are reordered according to a space-filling curve. - -For regular formats, 16x16 tiles are used. This harmonizes with the default tile -size for binning and CRCs (transaction elimination). It also means a single line -(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line. - -For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are -used, where entire blocks are reorder. Most of these formats compress 4x4 -blocks, so this gives an effective 16x16 tiling. This justifies the tile size -intuitively, though it's not a rule: ASTC may uses larger blocks. - -Within a tile, the X and Y bits are interleaved (like Morton order), but with a -twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious. -Visually, addresses take the form:: - - | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) | - -Reference routines to encode/decode u-interleaved images are available in -``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling -curve. This reference implementation is used to unit test the optimized -implementation used in production. The optimized implementation is available in -``src/panfrost/shared/pan_tiling.c``. - -Although these routines are part of Panfrost, they are also used by Lima, as Arm -introduced the format with Utgard. It is the only tiling supported on Utgard. On -Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and -should be used instead where possible. However, not all formats are -compressible, so u-interleaved tiling remains an important fallback on Panfrost. - -Instancing ----------- - -The attribute descriptor lets the attribute unit compute the address of an -attribute given the vertex and instance ID. Unfortunately, the way this works is -rather complicated when instancing is enabled. - -To explain this, first we need to explain how compute and vertex threads are -dispatched. When a quad is dispatched, it receives a single, linear index. -However, we need to translate that index into a (vertex id, instance id) pair. -One option would be to do: - -.. math:: - \text{vertex id} = \text{linear id} \% \text{num vertices} - - \text{instance id} = \text{linear id} / \text{num vertices} - -but this involves a costly division and modulus by an arbitrary number. -Instead, we could pad num_vertices. We dispatch padded_num_vertices * -num_instances threads instead of num_vertices * num_instances, which results -in some "extra" threads with vertex_id >= num_vertices, which we have to -discard. The more we pad num_vertices, the more "wasted" threads we -dispatch, but the division is potentially easier. - -One straightforward choice is to pad num_vertices to the next power of two, -which means that the division and modulus are just simple bit shifts and -masking. But the actual algorithm is a bit more complicated. The thread -dispatcher has special support for dividing by 3, 5, 7, and 9, in addition -to dividing by a power of two. As a result, padded_num_vertices can be -1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, -since we need less padding. - -padded_num_vertices is picked by the hardware. The driver just specifies the -actual number of vertices. Note that padded_num_vertices is a multiple of four -(presumably because threads are dispatched in groups of 4). Also, -padded_num_vertices is always at least one more than num_vertices, which seems -like a quirk of the hardware. For larger num_vertices, the hardware uses the -following algorithm: using the binary representation of num_vertices, we look at -the most significant set bit as well as the following 3 bits. Let n be the -number of bits after those 4 bits. Then we set padded_num_vertices according to -the following table: - -========== ======================= -high bits padded_num_vertices -========== ======================= -1000 :math:`9 \cdot 2^n` -1001 :math:`5 \cdot 2^{n+1}` -101x :math:`3 \cdot 2^{n+2}` -110x :math:`7 \cdot 2^{n+1}` -111x :math:`2^{n+4}` -========== ======================= - -For example, if num_vertices = 70 is passed to glDraw(), its binary -representation is 1000110, so n = 3 and the high bits are 1000, and -therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72. - -The attribute unit works in terms of the original linear_id. if -num_instances = 1, then they are the same, and everything is simple. -However, with instancing things get more complicated. There are four -possible modes, two of them we can group together: - -1. Use the linear_id directly. Only used when there is no instancing. - -2. Use the linear_id modulo a constant. This is used for per-vertex -attributes with instancing enabled by making the constant equal -padded_num_vertices. Because the modulus is always padded_num_vertices, this -mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. -The shift field specifies the power of two, while the extra_flags field -specifies the odd number. If shift = n and extra_flags = m, then the modulus -is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as -computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set -extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware -algorithm used to get padded_num_vertices in order to correctly implement -per-vertex attributes. - -3. Divide the linear_id by a constant. In order to correctly implement -instance divisors, we have to divide linear_id by padded_num_vertices times -to user-specified divisor. So first we compute padded_num_vertices, again -following the exact same algorithm that the hardware uses, then multiply it -by the GL-level divisor to get the hardware-level divisor. This case is -further divided into two more cases. If the hardware-level divisor is a -power of two, then we just need to shift. The shift amount is specified by -the shift field, so that the hardware-level divisor is just -:math:`2^\text{shift}`. - -If it isn't a power of two, then we have to divide by an arbitrary integer. -For that, we use the well-known technique of multiplying by an approximation -of the inverse. The driver must compute the magic multiplier and shift -amount, and then the hardware does the multiplication and shift. The -hardware and driver also use the "round-down" optimization as described in -https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. -The hardware further assumes the multiplier is between :math:`2^{31}` and -:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set -to 0 by the driver -- presumably this simplifies the hardware multiplier a -little. The hardware first multiplies linear_id by the multiplier and -takes the high 32 bits, then applies the round-down correction if -extra_flags = 1, then finally shifts right by the shift field. - -There are some differences between ridiculousfish's algorithm and the Mali -hardware algorithm, which means that the reference code from ridiculousfish -doesn't always produce the right constants. Mali does not use the pre-shift -optimization, since that would make a hardware implementation slower (it -would have to always do the pre-shift, multiply, and post-shift operations). -It also forces the multiplier to be at least :math:`2^{31}`, which means -that the exponent is entirely fixed, so there is no trial-and-error. -Altogether, given the divisor d, the algorithm the driver must follow is: - -1. Set shift = :math:`\lfloor \log_2(d) \rfloor`. -2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. -3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set - magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor = - m and extra_flags = 0. + panfrost/* diff --git a/docs/drivers/panfrost/drm-shim.rst b/docs/drivers/panfrost/drm-shim.rst new file mode 100644 index 0000000000000..874ac37c2f94a --- /dev/null +++ b/docs/drivers/panfrost/drm-shim.rst @@ -0,0 +1,84 @@ + +drm-shim +======== + +Panfrost implements ``drm-shim``, stubbing out the Panfrost kernel interface. +Use cases for this functionality include: + +- Future hardware bring up +- Running shader-db on non-Mali workstations +- Reproducing compiler (and some driver) bugs without Mali hardware + +Although Mali hardware is usually paired with an Arm CPU, Panfrost is portable C +code and should work on any Linux machine. In particular, you can test the +compiler on shader-db on an Intel desktop. + +To build Mesa with Panfrost drm-shim, configure Meson with +``-Dgallium-drivers=panfrost`` and ``-Dtools=drm-shim``. See the above +building section for a full invocation. The drm-shim binary will be built to +``build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so``. + +To use, set the ``LD_PRELOAD`` environment variable to the drm-shim binary. It +may also be necessary to set ``LIBGL_DRIVERS_PATH`` to the location where Mesa +was installed. + +By default, drm-shim mocks a Mali-G52 system. To select a specific Mali GPU, +set the ``PAN_GPU_ID`` environment variable to the desired GPU ID: + +========= ============= ======= +Product Architecture GPU ID +========= ============= ======= +Mali-T720 Midgard (v4) 720 +Mali-T860 Midgard (v5) 860 +Mali-G72 Bifrost (v6) 6221 +Mali-G52 Bifrost (v7) 7212 +Mali-G57 Valhall (v9) 9093 +Mali-G610 Valhall (v10) a867 +========= ============= ======= + +Additional GPU IDs are enumerated in the ``panfrost_model_list`` list in +``src/panfrost/lib/pan_props.c``. + +As an example: assuming Mesa is installed to a local path ``~/lib`` and Mesa's +build directory is ``~/mesa/build``, a shader can be compiled for Mali-G52 as: + +.. code-block:: sh + + ~/shader-db$ BIFROST_MESA_DEBUG=shaders \ + LIBGL_DRIVERS_PATH=~/lib/dri/ \ + LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ + PAN_GPU_ID=7212 \ + ./run shaders/glmark/1-1.shader_test + +The same shader can be compiled for Mali-T720 as: + +.. code-block:: sh + + ~/shader-db$ MIDGARD_MESA_DEBUG=shaders \ + LIBGL_DRIVERS_PATH=~/lib/dri/ \ + LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ + PAN_GPU_ID=720 \ + ./run shaders/glmark/1-1.shader_test + +These examples set the compilers' ``shaders`` debug flags to dump the optimized +NIR, backend IR after instruction selection, backend IR after register +allocation and scheduling, and a disassembly of the final compiled binary. + +As another example, this invocation runs a single dEQP test "on" Mali-G52, +pretty-printing GPU data structures and disassembling all shaders +(``PAN_MESA_DEBUG=trace``) as well as dumping raw GPU memory +(``PAN_MESA_DEBUG=dump``). The ``EGL_PLATFORM=surfaceless`` environment variable +and various flags to dEQP mimic the surfaceless environment that our +continuous integration (CI) uses. This eliminates window system dependencies, +although it requires a specially built CTS: + +.. code-block:: sh + + ~/VK-GL-CTS/build/external/openglcts/modules$ PAN_MESA_DEBUG=trace,dump \ + LIBGL_DRIVERS_PATH=~/lib/dri/ \ + LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so \ + PAN_GPU_ID=7212 EGL_PLATFORM=surfaceless \ + ./glcts --deqp-surface-type=pbuffer \ + --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 \ + --deqp-surface-height=256 -n \ + dEQP-GLES31.functional.shaders.builtin_functions.common.abs.float_highp_compute diff --git a/docs/drivers/panfrost/instancing.rst b/docs/drivers/panfrost/instancing.rst new file mode 100644 index 0000000000000..d4565af315563 --- /dev/null +++ b/docs/drivers/panfrost/instancing.rst @@ -0,0 +1,112 @@ +Instancing +========== + +The attribute descriptor lets the attribute unit compute the address of an +attribute given the vertex and instance ID. Unfortunately, the way this works is +rather complicated when instancing is enabled. + +To explain this, first we need to explain how compute and vertex threads are +dispatched. When a quad is dispatched, it receives a single, linear index. +However, we need to translate that index into a (vertex id, instance id) pair. +One option would be to do: + +.. math:: + \text{vertex id} = \text{linear id} \% \text{num vertices} + + \text{instance id} = \text{linear id} / \text{num vertices} + +but this involves a costly division and modulus by an arbitrary number. +Instead, we could pad num_vertices. We dispatch padded_num_vertices * +num_instances threads instead of num_vertices * num_instances, which results +in some "extra" threads with vertex_id >= num_vertices, which we have to +discard. The more we pad num_vertices, the more "wasted" threads we +dispatch, but the division is potentially easier. + +One straightforward choice is to pad num_vertices to the next power of two, +which means that the division and modulus are just simple bit shifts and +masking. But the actual algorithm is a bit more complicated. The thread +dispatcher has special support for dividing by 3, 5, 7, and 9, in addition +to dividing by a power of two. As a result, padded_num_vertices can be +1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, +since we need less padding. + +padded_num_vertices is picked by the hardware. The driver just specifies the +actual number of vertices. Note that padded_num_vertices is a multiple of four +(presumably because threads are dispatched in groups of 4). Also, +padded_num_vertices is always at least one more than num_vertices, which seems +like a quirk of the hardware. For larger num_vertices, the hardware uses the +following algorithm: using the binary representation of num_vertices, we look at +the most significant set bit as well as the following 3 bits. Let n be the +number of bits after those 4 bits. Then we set padded_num_vertices according to +the following table: + +========== ======================= +high bits padded_num_vertices +========== ======================= +1000 :math:`9 \cdot 2^n` +1001 :math:`5 \cdot 2^{n+1}` +101x :math:`3 \cdot 2^{n+2}` +110x :math:`7 \cdot 2^{n+1}` +111x :math:`2^{n+4}` +========== ======================= + +For example, if num_vertices = 70 is passed to glDraw(), its binary +representation is 1000110, so n = 3 and the high bits are 1000, and +therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72. + +The attribute unit works in terms of the original linear_id. if +num_instances = 1, then they are the same, and everything is simple. +However, with instancing things get more complicated. There are four +possible modes, two of them we can group together: + +1. Use the linear_id directly. Only used when there is no instancing. + +2. Use the linear_id modulo a constant. This is used for per-vertex +attributes with instancing enabled by making the constant equal +padded_num_vertices. Because the modulus is always padded_num_vertices, this +mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. +The shift field specifies the power of two, while the extra_flags field +specifies the odd number. If shift = n and extra_flags = m, then the modulus +is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as +computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set +extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware +algorithm used to get padded_num_vertices in order to correctly implement +per-vertex attributes. + +3. Divide the linear_id by a constant. In order to correctly implement +instance divisors, we have to divide linear_id by padded_num_vertices times +to user-specified divisor. So first we compute padded_num_vertices, again +following the exact same algorithm that the hardware uses, then multiply it +by the GL-level divisor to get the hardware-level divisor. This case is +further divided into two more cases. If the hardware-level divisor is a +power of two, then we just need to shift. The shift amount is specified by +the shift field, so that the hardware-level divisor is just +:math:`2^\text{shift}`. + +If it isn't a power of two, then we have to divide by an arbitrary integer. +For that, we use the well-known technique of multiplying by an approximation +of the inverse. The driver must compute the magic multiplier and shift +amount, and then the hardware does the multiplication and shift. The +hardware and driver also use the "round-down" optimization as described in +https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. +The hardware further assumes the multiplier is between :math:`2^{31}` and +:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set +to 0 by the driver -- presumably this simplifies the hardware multiplier a +little. The hardware first multiplies linear_id by the multiplier and +takes the high 32 bits, then applies the round-down correction if +extra_flags = 1, then finally shifts right by the shift field. + +There are some differences between ridiculousfish's algorithm and the Mali +hardware algorithm, which means that the reference code from ridiculousfish +doesn't always produce the right constants. Mali does not use the pre-shift +optimization, since that would make a hardware implementation slower (it +would have to always do the pre-shift, multiply, and post-shift operations). +It also forces the multiplier to be at least :math:`2^{31}`, which means +that the exponent is entirely fixed, so there is no trial-and-error. +Altogether, given the divisor d, the algorithm the driver must follow is: + +1. Set shift = :math:`\lfloor \log_2(d) \rfloor`. +2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. +3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set + magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor = + m and extra_flags = 0. diff --git a/docs/drivers/panfrost/texcomp.rst b/docs/drivers/panfrost/texcomp.rst new file mode 100644 index 0000000000000..2cb6c9d59a080 --- /dev/null +++ b/docs/drivers/panfrost/texcomp.rst @@ -0,0 +1,17 @@ +Compressed texture support +========================== + +In the driver, Panfrost supports ASTC, ETC, and all BCn formats (e.g. RGTC, +S3TC, etc.) However, Panfrost depends on the hardware to support these formats +efficiently. All supported Mali architectures support these formats, but not +every system-on-chip with a Mali GPU support all these formats. Many lower-end +systems lack support for some BCn formats, which can cause problems when playing +desktop games with Panfrost. To check whether this issue applies to your +system-on-chip, Panfrost includes a ``panfrost_texfeatures`` tool to query +supported formats. + +To use this tool, include the option ``-Dtools=panfrost`` when configuring Mesa. +Then inside your Mesa build directory, the tool is located at +``src/panfrost/tools/panfrost_texfeatures``. Copy it to your target device, +set as executable as necessary, and run on the target device. A table of +supported formats will be printed to standard output. diff --git a/docs/drivers/panfrost/tiling.rst b/docs/drivers/panfrost/tiling.rst new file mode 100644 index 0000000000000..08c311bd55abd --- /dev/null +++ b/docs/drivers/panfrost/tiling.rst @@ -0,0 +1,38 @@ + +U-interleaved tiling +==================== + +Panfrost supports u-interleaved tiling. U-interleaved tiling is +indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier. + +The tiling reorders whole pixels (blocks). It does not compress or modify the +pixels themselves, so it can be used for any image format. Internally, images +are divided into tiles. Tiles occur in source order, but pixels (blocks) within +each tile are reordered according to a space-filling curve. + +For regular formats, 16x16 tiles are used. This harmonizes with the default tile +size for binning and CRCs (transaction elimination). It also means a single line +(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line. + +For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are +used, where entire blocks are reorder. Most of these formats compress 4x4 +blocks, so this gives an effective 16x16 tiling. This justifies the tile size +intuitively, though it's not a rule: ASTC may uses larger blocks. + +Within a tile, the X and Y bits are interleaved (like Morton order), but with a +twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious. +Visually, addresses take the form:: + + | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) | + +Reference routines to encode/decode u-interleaved images are available in +``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling +curve. This reference implementation is used to unit test the optimized +implementation used in production. The optimized implementation is available in +``src/panfrost/shared/pan_tiling.c``. + +Although these routines are part of Panfrost, they are also used by Lima, as Arm +introduced the format with Utgard. It is the only tiling supported on Utgard. On +Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and +should be used instead where possible. However, not all formats are +compressible, so u-interleaved tiling remains an important fallback on Panfrost.