diff --git a/docs/gallium/buffermapping.rst b/docs/gallium/buffermapping.rst new file mode 100644 index 00000000000..b876ad971e9 --- /dev/null +++ b/docs/gallium/buffermapping.rst @@ -0,0 +1,414 @@ +Buffer mapping patterns +----------------------- + +There are two main strategies the driver has for CPU access to GL buffer +objects. One is that the GL calls allocate temporary storage and blit to the GPU +at +``glBufferSubData()``/``glBufferData()``/``glFlushMappedBufferRange()``/``glUnmapBuffer()`` +time. This makes the behavior easily match. However, this may be more costly +than direct mapping of the GL BO on some platforms, and is essentially not +available to tiling GPUs (since tiling involves running through the command +stream multiple times). Thus, GL has additional interfaces to help make it so +apps can directly access memory while avoiding implicit blocking on the GPU +rendering from those BOs. + +Rendering engines have a variety of knobs to set on those GL interfaces for data +upload, and as a whole they seem to take just about every path available. Let's +look at some examples to see how they might constrain GL driver buffer upload +behavior. + +Portal 2 +======== + +.. code-block:: console: + + 1030842 glXSwapBuffers(dpy = 0x82a8000, drawable = 20971540) + 1030876 glBufferDataARB(target = GL_ELEMENT_ARRAY_BUFFER, size = 65536, data = NULL, usage = GL_DYNAMIC_DRAW) + 1030877 glBufferSubData(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, size = 576, data = blob(576)) + 1030896 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 526, count = 252, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 0) + 1030915 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 19657, count = 36, type = GL_UNSIGNED_SHORT, indices = 0x1f8, basevertex = 0) + 1030917 glBufferDataARB(target = GL_ARRAY_BUFFER, size = 1572864, data = NULL, usage = GL_DYNAMIC_DRAW) + 1030918 glBufferSubData(target = GL_ARRAY_BUFFER, offset = 0, size = 128, data = blob(128)) + 1030919 glBufferSubData(target = GL_ELEMENT_ARRAY_BUFFER, offset = 576, size = 12, data = blob(12)) + 1030936 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 3, count = 6, type = GL_UNSIGNED_SHORT, indices = 0x240, basevertex = 0) + 1030937 glBufferSubData(target = GL_ARRAY_BUFFER, offset = 128, size = 128, data = blob(128)) + 1030938 glBufferSubData(target = GL_ELEMENT_ARRAY_BUFFER, offset = 588, size = 12, data = blob(12)) + 1030940 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 4, end = 7, count = 6, type = GL_UNSIGNED_SHORT, indices = 0x24c, basevertex = 0) + [... repeated draws at increasing offsets] + 1033097 glXSwapBuffers(dpy = 0x82a8000, drawable = 20971540) + +From this sequence, we can see that it is important that the driver either +implement ``glBufferSubData()`` as a blit from a streaming uploader in sequence with +the ``glDraw*()`` calls (a common behavior for non-tiled GPUs, particularly those with +dedicated memory), or that you: + +1) Track the valid range of the buffer so that you don't have to flush the draws + and synchronize on each following ``glBufferSubData()``. + +2) Reallocate the buffer storage on ``glBufferData`` so that your first + ``glBufferSubData()`` of the frame doesn't stall on the last frame's + rendering completing. + +You can't just empty your valid range on ``glBufferData()`` unless you know that +the GPU access from the previous frame has completed. This pattern of +incrementing ``glBufferSubData()`` offsets interleaved with draws from that data +is common among newer Valve games. + +.. code-block:: console: + [ during setup ] + + 679259 glGenBuffersARB(n = 1, buffers = &1314) + 679260 glBindBufferARB(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1314) + 679261 glBufferDataARB(target = GL_ELEMENT_ARRAY_BUFFER, size = 3072, data = NULL, usage = GL_STATIC_DRAW) + 679264 glMapBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 3072, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT) = 0xd7384000 + 679269 glFlushMappedBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 3072) + 679270 glUnmapBuffer(target = GL_ELEMENT_ARRAY_BUFFER) = GL_TRUE + + [... setup of other buffers on this binding point] + + 679343 glBindBufferARB(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1314) + 679344 glMapBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 768, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT) = 0xd7384000 + 679346 glFlushMappedBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 768) + 679347 glUnmapBuffer(target = GL_ELEMENT_ARRAY_BUFFER) = GL_TRUE + 679348 glMapBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 768, length = 768, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT) = 0xd7384300 + 679350 glFlushMappedBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 768) + 679351 glUnmapBuffer(target = GL_ELEMENT_ARRAY_BUFFER) = GL_TRUE + 679352 glMapBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 1536, length = 768, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT) = 0xd7384600 + 679354 glFlushMappedBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 768) + 679355 glUnmapBuffer(target = GL_ELEMENT_ARRAY_BUFFER) = GL_TRUE + 679356 glMapBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 2304, length = 768, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT) = 0xd7384900 + 679358 glFlushMappedBufferRange(target = GL_ELEMENT_ARRAY_BUFFER, offset = 0, length = 768) + 679359 glUnmapBuffer(target = GL_ELEMENT_ARRAY_BUFFER) = GL_TRUE + + [... setup completes and we start drawing later] + + 761845 glBindBufferARB(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1314) + 761846 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 323, count = 384, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 0) + +This suggests that, for non-blitting drivers, resetting your "might be used on +the GPU" range after a stall could save you a bunch of additional GPU stalls +during setup. + +Terraria +======== + +.. code-block:: console: + + 167581 glXSwapBuffers(dpy = 0x3004630, drawable = 25165844) + + 167585 glBufferData(target = GL_ARRAY_BUFFER, size = 196608, data = NULL, usage = GL_STREAM_DRAW) + 167586 glBufferSubData(target = GL_ARRAY_BUFFER, offset = 0, size = 1728, data = blob(1728)) + 167588 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 71, count = 108, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 0) + 167589 glBufferData(target = GL_ARRAY_BUFFER, size = 196608, data = NULL, usage = GL_STREAM_DRAW) + 167590 glBufferSubData(target = GL_ARRAY_BUFFER, offset = 0, size = 27456, data = blob(27456)) + 167592 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 7, count = 12, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 0) + 167594 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 3, count = 6, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 8) + 167596 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 3, count = 6, type = GL_UNSIGNED_SHORT, indices = NULL, basevertex = 12) + [...] + +In this game, we can see ``glBufferData()`` being used on the same array buffer +throughout, to get new storage so that the ``glBufferSubData()`` doesn't cause +synchronization. + +Don't Starve +============ + +.. code-block:: console: + + 7251917 glGenBuffers(n = 1, buffers = &115052) + 7251918 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 115052) + 7251919 glBufferData(target = GL_ARRAY_BUFFER, size = 144, data = blob(144), usage = GL_STREAM_DRAW) + 7251921 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 115052) + 7251928 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 6) + 7251930 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 114872) + 7251936 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 18) + 7251938 glGenBuffers(n = 1, buffers = &115053) + 7251939 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 115053) + 7251940 glBufferData(target = GL_ARRAY_BUFFER, size = 144, data = blob(144), usage = GL_STREAM_DRAW) + 7251942 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 115053) + 7251949 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 6) + 7251973 glXSwapBuffers(dpy = 0x86dd860, drawable = 20971540) + [... drawing next frame] + 7252388 glDeleteBuffers(n = 1, buffers = &115052) + 7252389 glDeleteBuffers(n = 1, buffers = &115053) + 7252390 glXSwapBuffers(dpy = 0x86dd860, drawable = 20971540) + +In this game we have a lot of tiny ``glBufferData()`` calls, suggesting that we +could see working set wins and possibly CPU overhead reduction by packing small +GL buffers in the same BO. Interestingly, the deletes of the temporary buffers +always happen at the end of the next frame. + +Euro Truck Simulator +==================== + +.. code-block:: console: + + [usage of VBO 14,15] + [...] + 885199 glXSwapBuffers(dpy = 0x379a3e0, drawable = 20971527) + 885203 glInvalidateBufferData(buffer = 14) + 885204 glInvalidateBufferData(buffer = 15) + [...] + 889330 glXSwapBuffers(dpy = 0x379a3e0, drawable = 20971527) + 889334 glInvalidateBufferData(buffer = 12) + 889335 glInvalidateBufferData(buffer = 16) + [...] + 893461 glXSwapBuffers(dpy = 0x379a3e0, drawable = 20971527) + 893462 glClientWaitSync(sync = 0x77eee10, flags = 0x0, timeout = 0) = GL_ALREADY_SIGNALED + 893463 glDeleteSync(sync = 0x780a630) + 893464 glFenceSync(condition = GL_SYNC_GPU_COMMANDS_COMPLETE, flags = 0) = 0x78ec730 + 893465 glInvalidateBufferData(buffer = 13) + 893466 glInvalidateBufferData(buffer = 17) + 893505 glBindBuffer(target = GL_COPY_READ_BUFFER, buffer = 14) + 893506 glMapBufferRange(target = GL_COPY_READ_BUFFER, offset = 0, length = 788, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b034efd1000 + 893508 glUnmapBuffer(target = GL_COPY_READ_BUFFER) = GL_TRUE + 893509 glBindBuffer(target = GL_COPY_READ_BUFFER, buffer = 15) + 893510 glMapBufferRange(target = GL_COPY_READ_BUFFER, offset = 0, length = 32, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b034e5df000 + 893512 glUnmapBuffer(target = GL_COPY_READ_BUFFER) = GL_TRUE + 893532 glBindVertexBuffers(first = 0, count = 2, buffers = {10, 15}, offsets = {0, 0}, strides = {52, 16}) + 893552 glDrawElementsInstancedBaseVertex(mode = GL_TRIANGLES, count = 18, type = GL_UNSIGNED_SHORT, indices = 0x13f280, instancecount = 1, basevertex = 25131) + 893609 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 6) + 893732 glBindVertexBuffers(first = 0, count = 1, buffers = &14, offsets = &0, strides = &48) + 893733 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 14) + 893744 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 6, type = GL_UNSIGNED_SHORT, indices = 0xf0, basevertex = 0) + 893759 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 24, type = GL_UNSIGNED_SHORT, indices = 0x2e0, basevertex = 6) + 893786 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 600, type = GL_UNSIGNED_SHORT, indices = 0xe87b0, basevertex = 21515) + 893822 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 6) + 893845 glBindBuffer(target = GL_COPY_READ_BUFFER, buffer = 14) + 893846 glMapBufferRange(target = GL_COPY_READ_BUFFER, offset = 788, length = 788, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b034efd1314 + 893848 glUnmapBuffer(target = GL_COPY_READ_BUFFER) = GL_TRUE + 893886 glDrawElementsInstancedBaseVertex(mode = GL_TRIANGLES, count = 18, type = GL_UNSIGNED_SHORT, indices = 0x13f280, instancecount = 1, basevertex = 25131) + 893943 glDrawArrays(mode = GL_TRIANGLES, first = 0, count = 6) + +At the start of this frame, buffer 14 and 15 haven't been used in the previous 2 +frames, and the ``GL_ARB_sync`` fence has ensured that the GPU has at least started +frame n-1 as the CPU starts the current frame. The first map is ``offset = 0, +INVALIDATE_BUFFER | UNSYNCHRONIZED``, which suggests that the driver should +reallocate storage for the mapping even in the ``UNSYNCHRONIZED`` case, except +that the buffer is definitely going to be idle, making reallocation unnecessary +(you may need to empty your valid range, though, to prevent unnecessary batch +flushes). + +Also note the use of a totally unrelated binding point for the mapping of the +vertex array -- you can't effectively use it as a hint for any buffer placement +in memory. The game does also use ``glCopyBufferSubData()``, but only on a +different buffer. + + +Plague Inc +========== + +.. code-block:: console: + + 1640732 glXSwapBuffers(dpy = 0xb218f20, drawable = 23068674) + 1640733 glClientWaitSync(sync = 0xb4141430, flags = 0x0, timeout = 0) = GL_ALREADY_SIGNALED + 1640734 glDeleteSync(sync = 0xb4141430) + 1640735 glFenceSync(condition = GL_SYNC_GPU_COMMANDS_COMPLETE, flags = 0) = 0xb4141430 + + 1640780 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 78) + 1640787 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 79) + 1640788 glDrawElements(mode = GL_TRIANGLES, count = 9636, type = GL_UNSIGNED_SHORT, indices = NULL) + 1640795 glDrawElements(mode = GL_TRIANGLES, count = 9636, type = GL_UNSIGNED_SHORT, indices = NULL) + 1640813 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1096) + 1640814 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 67584, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0xbfef4000 + 1640815 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1091) + 1640816 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 12, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0xc3998000 + 1640817 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1096) + 1640819 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 352) + 1640820 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1640821 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1091) + 1640823 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 12) + 1640824 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1640825 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 1096) + 1640831 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1091) + 1640832 glDrawElements(mode = GL_TRIANGLES, count = 6, type = GL_UNSIGNED_SHORT, indices = NULL) + + 1640847 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1096) + 1640848 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 352, length = 67584, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0xbfef4160 + 1640849 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1091) + 1640850 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 88, length = 12, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0xc3998058 + 1640851 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1096) + 1640853 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 352) + 1640854 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1640855 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 1091) + 1640857 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 12) + 1640858 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1640863 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 6, type = GL_UNSIGNED_SHORT, indices = 0x58, basevertex = 4) + +At the start of this frame, the VBOs haven't been used in about 6 frames, and +the ``GL_ARB_sync`` fence has ensured that the GPU has started frame n-1. + +Note the use of ``glFlushMappedBufferRange()`` on a small fraction of the size +of the VBO -- it is important that a blitting driver make use of the flush +ranges when in explicit mode. + +Darkest Dungeon +=============== + +.. code-block:: console: + + 938384 glXSwapBuffers(dpy = 0x377fcd0, drawable = 23068692) + + 938385 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 2) + 938386 glBufferData(target = GL_ARRAY_BUFFER, size = 1048576, data = NULL, usage = GL_STREAM_DRAW) + 938511 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 2) + 938512 glMapBufferRange(target = GL_ARRAY_BUFFER, offset = 0, length = 1048576, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7a73fcaa7000 + 938514 glFlushMappedBufferRange(target = GL_ARRAY_BUFFER, offset = 0, length = 512) + 938515 glUnmapBuffer(target = GL_ARRAY_BUFFER) = GL_TRUE + 938523 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1) + 938524 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 2) + 938525 glDrawElements(mode = GL_TRIANGLES, count = 24, type = GL_UNSIGNED_SHORT, indices = NULL) + 938527 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 2) + 938528 glMapBufferRange(target = GL_ARRAY_BUFFER, offset = 0, length = 1048576, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7a73fcaa7000 + 938530 glFlushMappedBufferRange(target = GL_ARRAY_BUFFER, offset = 512, length = 512) + 938531 glUnmapBuffer(target = GL_ARRAY_BUFFER) = GL_TRUE + 938539 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 1) + 938540 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 2) + 938541 glDrawElements(mode = GL_TRIANGLES, count = 24, type = GL_UNSIGNED_SHORT, indices = 0x30) + [... more maps and draws at increasing offsets] + +Interesting note for this game, after the initial ``glBufferData()`` in the +frame to reallocate the storage, it unsync maps the whole buffer each time, and +just changes which region it flushes. The same GL buffer name is used in every +frame. + +Tabletop Simulator +================== + +.. code-block:: console: + + 1287594 glXSwapBuffers(dpy = 0x3e10810, drawable = 23068692) + 1287595 glClientWaitSync(sync = 0x7abf554e37b0, flags = 0x0, timeout = 0) = GL_ALREADY_SIGNALED + 1287596 glDeleteSync(sync = 0x7abf554e37b0) + 1287597 glFenceSync(condition = GL_SYNC_GPU_COMMANDS_COMPLETE, flags = 0) = 0x7abf56647490 + + 1287614 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 480) + 1287615 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 384, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7abf2e79a000 + 1287642 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 614) + 1287650 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 5) + 1287651 glBufferSubData(target = GL_COPY_WRITE_BUFFER, offset = 0, size = 1088, data = blob(1088)) + 1287652 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 615) + 1287653 glDrawElements(mode = GL_TRIANGLES, count = 1788, type = GL_UNSIGNED_SHORT, indices = NULL) + [... more draw calls] + 1289055 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 480) + 1289057 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 384) + 1289058 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1289059 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 480) + 1289066 glDrawArrays(mode = GL_TRIANGLE_STRIP, first = 12, count = 4) + 1289068 glDrawArrays(mode = GL_TRIANGLE_STRIP, first = 8, count = 4) + 1289553 glXSwapBuffers(dpy = 0x3e10810, drawable = 23068692) + +In this app, buffer 480 gets used like this every other frame. The ``GL_ARB_sync`` +fence ensures that frame n-1 has started on the GPU before CPU work starts on +the current frame, so the unsynchronized access to the buffers is safe. + +Hollow Knight +============= + +.. code-block:: console: + + 1873034 glXSwapBuffers(dpy = 0x28609d0, drawable = 23068692) + 1873035 glClientWaitSync(sync = 0x7b1a5ca6e130, flags = 0x0, timeout = 0) = GL_ALREADY_SIGNALED + 1873036 glDeleteSync(sync = 0x7b1a5ca6e130) + 1873037 glFenceSync(condition = GL_SYNC_GPU_COMMANDS_COMPLETE, flags = 0) = 0x7b1a5ca6e130 + 1873038 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 29) + 1873039 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 8640, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b1a04c7e000 + 1873040 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 30) + 1873041 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 720, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b1a07430000 + 1873065 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 29) + 1873067 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 8640) + 1873068 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1873069 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 30) + 1873071 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 720) + 1873072 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1873073 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 29) + 1873074 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 8640, length = 576, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b1a04c801c0 + 1873075 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 30) + 1873076 glMapBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 720, length = 72, access = GL_MAP_WRITE_BIT | GL_MAP_FLUSH_EXPLICIT_BIT | GL_MAP_UNSYNCHRONIZED_BIT) = 0x7b1a074302d0 + 1873077 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 29) + 1873079 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 576) + 1873080 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1873081 glBindBuffer(target = GL_COPY_WRITE_BUFFER, buffer = 30) + 1873083 glFlushMappedBufferRange(target = GL_COPY_WRITE_BUFFER, offset = 0, length = 72) + 1873084 glUnmapBuffer(target = GL_COPY_WRITE_BUFFER) = GL_TRUE + 1873085 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 29) + 1873096 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 30) + 1873097 glDrawElementsBaseVertex(mode = GL_TRIANGLES, count = 36, type = GL_UNSIGNED_SHORT, indices = 0x2d0, basevertex = 240) + +In this app, buffer 29/30 get used like this starting from offset 0 every other +frame. The ``GL_ARB_sync`` fence is used to make sure that the GPU has reached the +start of the previous frame before we go unsynchronized writing over the n-2 +frame's buffer. + +Borderlands 2 +============= + +.. code-block:: console: + + 3561998 glFlush() + 3562004 glXSwapBuffers(dpy = 0xbaf0f90, drawable = 23068705) + 3562006 glClientWaitSync(sync = 0x231c2ab0, flags = GL_SYNC_FLUSH_COMMANDS_BIT, timeout = 10000000000) = GL_ALREADY_SIGNALED + 3562007 glDeleteSync(sync = 0x231c2ab0) + 3562008 glFenceSync(condition = GL_SYNC_GPU_COMMANDS_COMPLETE, flags = 0) = 0x231aadc0 + + 3562050 glBindBufferARB(target = GL_ARRAY_BUFFER, buffer = 1193) + 3562051 glMapBufferRange(target = GL_ARRAY_BUFFER, offset = 0, length = 1792, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT) = 0xde056000 + 3562053 glUnmapBufferARB(target = GL_ARRAY_BUFFER) = GL_TRUE + 3562054 glBindBufferARB(target = GL_ARRAY_BUFFER, buffer = 1194) + 3562055 glMapBufferRange(target = GL_ARRAY_BUFFER, offset = 0, length = 1280, access = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT) = 0xd9426000 + 3562057 glUnmapBufferARB(target = GL_ARRAY_BUFFER) = GL_TRUE + [... unrelated draws] + 3563051 glBindBufferARB(target = GL_ARRAY_BUFFER, buffer = 1193) + 3563064 glBindBufferARB(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 875) + 3563065 glDrawElementsInstancedARB(mode = GL_TRIANGLES, count = 72, type = GL_UNSIGNED_SHORT, indices = NULL, instancecount = 28) + +The ``GL_ARB_sync`` fence ensures that the GPU has started frame n-1 before the CPU +starts on the current frame. + +This sequence of buffer uploads appears in each frame with the same buffer +names, so you do need to handle the ``GL_MAP_INVALIDATE_BUFFER_BIT`` as a +reallocate if the buffer is GPU-busy (it wasn't in this trace capture) to avoid +stalls on the n-1 frame completing. + +Note that this is just one small buffer. Most of the vertex data goes through a +``glBufferSubData()``/``glDraw*()`` path with the VBO used across multiple +frames, with a ``glBufferData()`` when needing to wrap. + +Buffer mapping conclusions +-------------------------- + +* Non-blitting drivers must track the valid range of a freshly allocated buffer + as it gets uploaded in ``pipe_transfer_map()`` and avoid stalling on the GPU + when mapping an undefined portion of the buffer when ``glBufferSubData()`` is + interleaved with drawing. + +* Non-blitting drivers must reallocate storage on ``glBufferData(NULL)`` so that + the following ``glBufferSubData()`` won't stall. That ``glBufferData(NULL)`` + call will appear in the driver as an ``invalidate_resource()`` call if + ``PIPE_CAP_INVALIDATE_BUFFER`` is available. (If that flag is not set, then + mesa/st will create a new pipe_resource for you). Storage reallocation may be + skipped if you for some reason know that the buffer is idle, in which case you + can just empty the valid region. + +* Blitting drivers must use the ``transfer_flush_region()`` region + instead of the mapped range when ``PIPE_MAP_FLUSH_EXPLICIT`` is set, to avoid + blitting too much data. (When that bit is unset, you just blit the whole + mapped range at unmap time.) + +* Buffer valid range tracking in non-blitting drivers must use the + ``transfer_flush_region()`` region instead of the mapped range when + ``PIPE_MAP_FLUSH_EXPLICIT`` is set, to avoid excess stalls. + +* Buffer valid range tracking doesn't need to be fancy, "number of bytes + valid starting from 0" is sufficient for all examples found. + +* Use the ``pipe_debug_callback`` to report stalls on buffer mapping to ease + debug. + +* Buffer binding points are not useful for tuning buffer placement (See all the + ``PIPE_COPY_WRITE_BUFFER`` instances), you have to track the actual usage + history of a GL BO name. mesa/st does this for optimizing its state updates + on reallocation in the ``!PIPE_CAP_INVALIDATE_BUFFER`` case, and if you set + ``PIPE_CAP_INVALIDATE_BUFFER`` then you have to flag your own internal state + updates (VBO addresses, XFB addresses, texture buffer addresses, etc.) on + reallocation based on usage history. diff --git a/docs/gallium/index.rst b/docs/gallium/index.rst index 82dae274a54..6656bd6cf13 100644 --- a/docs/gallium/index.rst +++ b/docs/gallium/index.rst @@ -14,6 +14,7 @@ Contents: format context cso + buffermapping distro postprocess glossary