In looking at the profile of dEQP, GLES3 was spending 5-10% of its time in
ReadPixels, and almost all of that is b8g8r8a8_unorm8. It's really slow
because we're getting about 47MB/s by doing uncached reads 32 bits at a
time in the code-generated unpack. If we use NEON to generate larger bus
transactions, we can speed things up to 136MB/s. In comparison, raw
ldr/str read/writes with no byte swapping can hit a max of 216MB/sec.
Reviewed-by: Jesse Natalie <jenatali@microsoft.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10014>