docs/panfrost: Document u-interleaved tiling

The optimized routine documented the tiling format together with the software algorithm. The reference implementation wants the tiling format alone documented. Let's break out the high level documentation into somewhere centrally accessible, and refocus the comments in the optimized file on the optimization. This documentation is linked bidirectionally with both implementations, so it should be easy to find. Signed-off-by: Alyssa Rosenzweig <alyssa@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15803>
2022-04-14 10:53:08 -04:00 · 2022-04-14 10:53:08 -04:00 · fc1397d1d7
parent bb6c14a697
commit fc1397d1d7
2 changed files with 50 additions and 39 deletions
--- a/docs/drivers/panfrost.rst
+++ b/docs/drivers/panfrost.rst
@ -54,3 +54,41 @@ Chat
 Panfrost developers and users hang out on IRC at ``#panfrost`` on OFTC. Note
 that registering and authenticating with `NickServ` is required to prevent
 spam. `Join the chat. <https://webchat.oftc.net/?channels=#panfrost>`_
+
+U-interleaved tiling
+---------------------
+
+Panfrost supports u-interleaved tiling. U-interleaved tiling is
+indicated by the ``DRM_FORMAT_MOD_ARM_16X16_BLOCK_U_INTERLEAVED`` modifier.
+
+The tiling reorders whole pixels (blocks). It does not compress or modify the
+pixels themselves, so it can be used for any image format. Internally, images
+are divided into tiles. Tiles occur in source order, but pixels (blocks) within
+each tile are reordered according to a space-filling curve.
+
+For regular formats, 16x16 tiles are used. This harmonizes with the default tile
+size for binning and CRCs (transaction elimination). It also means a single line
+(16 pixels) at 4 bytes per pixel equals a single 64-byte cache line.
+
+For formats that are already block compressed (S3TC, RGTC, etc), 4x4 tiles are
+used, where entire blocks are reorder. Most of these formats compress 4x4
+blocks, so this gives an effective 16x16 tiling. This justifies the tile size
+intuitively, though it's not a rule: ASTC may uses larger blocks.
+
+Within a tile, the X and Y bits are interleaved (like Morton order), but with a
+twist: adjacent bit pairs are XORed. The reason to add XORs is not obvious.
+Visually, addresses take the form::
+
+   | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
+
+Reference routines to encode/decode u-interleaved images are available in
+``src/panfrost/shared/test/test-tiling.cpp``, which documents the space-filling
+curve. This reference implementation is used to unit test the optimized
+implementation used in production. The optimized implementation is available in
+``src/panfrost/shared/pan_tiling.c``.
+
+Although these routines are part of Panfrost, they are also used by Lima, as Arm
+introduced the format with Utgard. It is the only tiling supported on Utgard. On
+Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and
+should be used instead where possible. However, not all formats are
+compressible, so u-interleaved tiling remains an important fallback on Panfrost.
--- a/src/panfrost/shared/pan_tiling.c
+++ b/src/panfrost/shared/pan_tiling.c
@ -30,53 +30,26 @@
 #include "util/macros.h"
 #include "util/bitscan.h"

-/* This file implements software encode/decode of the tiling format used for
- * textures and framebuffers primarily on Utgard GPUs. Names for this format
- * include "Utgard-style tiling", "(Mali) swizzled textures", and
- * "U-interleaved" (the former two names being used in the community
- * Lima/Panfrost drivers; the latter name used internally at Arm).
- * Conceptually, like any tiling scheme, the pixel reordering attempts to 2D
- * spatial locality, to improve cache locality in both horizontal and vertical
- * directions.
+/*
+ * This file implements software encode/decode of u-interleaved textures.
+ * See docs/drivers/panfrost.rst for details on the format.
 *
- * This format is tiled: first, the image dimensions must be aligned to 16
- * pixels in each axis. Once aligned, the image is divided into 16x16 tiles.
- * This size harmonizes with other properties of the GPU; on Midgard,
- * framebuffer tiles are logically 16x16 (this is the tile size used in
- * Transaction Elimination and the minimum tile size used in Hierarchical
- * Tiling). Conversely, for a standard 4 bytes-per-pixel format (like
- * RGBA8888), 16 pixels * 4 bytes/pixel = 64 bytes, equal to the cache line
- * size.
+ * The tricky bit is ordering along the space-filling curve:
 *
- * Within each 16x16 block, the bits are reordered according to this pattern:
+ *    | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
 *
- * | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
- *
- * Basically, interleaving the X and Y bits, with XORs thrown in for every
- * adjacent bit pair.
- *
- * This is cheap to implement both encode/decode in both hardware and software.
- * In hardware, lines are simply rerouted to reorder and some XOR gates are
- * thrown in. Software has to be a bit more clever.
- *
- * In software, the trick is to divide the pattern into two lines:
+ * While interleaving bits is trivial in hardware, it is nontrivial in software.
+ * The trick is to divide the pattern up:
 *
 *    | y3 | y3 | y2 | y2 | y1 | y1 | y0 | y0 |
 *  ^ |  0 | x3 |  0 | x2 |  0 | x1 |  0 | x0 |
 *
- * That is, duplicate the bits of the Y and space out the bits of the X. The
- * top line is a function only of Y, so it can be calculated once per row and
- * stored in a register. The bottom line is simply X with the bits spaced out.
- * Spacing out the X is easy enough with a LUT, or by subtracting+ANDing the
- * mask pattern (abusing carry bits).
+ * That is, duplicate the bits of the Y and space out the bits of the X. The top
+ * line is a function only of Y, so it can be calculated once per row and stored
+ * in a register. The bottom line is simply X with the bits spaced out. Spacing
+ * out the X is easy enough with a LUT, or by subtracting+ANDing the mask
+ * pattern (abusing carry bits).
 *
- * This format is also supported on Midgard GPUs, where it *can* be used for
- * textures and framebuffers. That said, in practice it is usually as a
- * fallback layout; Midgard introduces Arm FrameBuffer Compression, which is
- * significantly more efficient than Utgard-style tiling and preferred for both
- * textures and framebuffers, where possible. For unsupported texture types,
- * for instance sRGB textures and framebuffers, this tiling scheme is used at a
- * performance penalty, as AFBC is not compatible.
 */

 /* Given the lower 4-bits of the Y coordinate, we would like to