Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539
Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539jholveck wants to merge 2 commits into
Conversation
Summary ======= I’ve spoken at length about the importance of avoiding copies. This PR is to eliminate the remaining (CPU-side) copy: copying from the OS-supplied buffer to a Python byte / bytearray object. We introduce buffer lifetime tracking in MSS so backends can safely reclaim or reuse screenshot buffers when downstream consumers are truly done with them. On GNU/Linux, the XShmGetImage backend now uses that mechanism to enable zero-copy screenshot buffers on Python 3.12 and newer. Benchmark --------- I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset, DDR4-2133, RTX 3090). I captured 1000 iterations of 3840×2160 screenshots as quickly as possible while forcing all pixel data to be read (using a NumPy sum). I ran A/B tests of enabling or disabling the new feature, taking a best-of-three test. Capture time decreased from 22.64 ms to 18.59 ms per frame. Put differently, this is from 44 FPS to 54 FPS. This is approximately 18% faster. Why === Previously, backends had to copy screenshot data into fresh Python-owned buffers to avoid reusing memory that might still be referenced by NumPy, Pillow, or other buffer consumers. That copy cost is significant for large captures and high frame-rate use cases. What changed ============ Internal infrastructure ----------------------- This change keeps MSS user-facing behavior the same while improving backend memory handling and performance. - Added new buffer-finalization plumbing that lets backends attach a finalizer to screenshot buffer ownership. - Updated core typing/contracts so grab can return generic buffer-compatible objects. (The user-facing contracts were updated in BoboTiG#521 and others, but this updates the internal contracts.) - Expanded documentation and release notes to explain direct buffer behavior and platform/version scope. - Updated packaging tests and test dependencies to include new buffer and integration test coverage. XShmGetImage backend -------------------- - Reworked the GNU/Linux XShmGetImage backend to use a reusable SHM slot pool with dynamic growth and finalizer-driven slot return. - Added shutdown/cleanup safeguards so slot destruction and connection shutdown are coordinated safely, including finalizer interactions. - Kept fallback behavior intact when MIT-SHM is unavailable or unsuitable. Behavior by runtime =================== - Python 3.12+ on GNU/Linux XShm backend: - zero-copy buffer exposure from SHM-backed storage - SHM slot is released when downstream buffer users release it - Pre-3.12: - copy-based behavior is retained - finalization happens immediately after copy Testing ======= - Added focused unit tests for buffer-finalizer semantics, including fast and slow paths and downstream memoryview trees. - Added GNU/Linux backend lifecycle tests covering: - release on normal finalization - failure while wrapping finalizing buffers - finalization after close - pre-3.12 immediate-finalization behavior - dynamic SHM pool growth failure behavior - threaded release during close to validate shutdown race protections Notes for maintainers ===================== - This PR is intentionally backend-agnostic at the plumbing layer, with initial zero-copy adoption in GNU/Linux XShmGetImage. - The design keeps the existing user-facing API, while making buffer lifetime explicit for backend resource management.
|
That's is so good! 🚀 |
|
Thank you! @halldorfannar has been unavailable, but is back in action now. I'd like him to take a look at this before you commit it, but I think it's in good shape. I think it's going to be pretty easy to use this for Windows GDI. I haven't yet looked at the macOS side; it's probably not hard to speed it up with this too, but I haven't done macOS development in many years. Once we commit this part, we might open an issue to see if other contributors want to tackle it. |
halldorfannar
left a comment
There was a problem hiding this comment.
Good job!
I had one minor nit about a doc a string (see comments) but then I had a bigger issue that I feel we should document somewhere. And this is about the memory requirements of the new fast path. Since by default we allocate two color buffers of the virtual monitor size, this can be a substantial amount of memory (for example, I typically run a 4k monitor and then a laptop screen and these are offset from each other, creating a rectangle that is much larger than the sum of these two screens. Depending on how people organize their code, we may end up with even more buffers than just two. I think we should mention this extra memory requirement somewhere, so users are aware.
I don't mind this approach, especially as a start and a way to do this with minimal changes to the existing API. I just feel we need to surface it. In the future we can look at ways for users to give us hints as they initialize the library, so we can use less memory. But that is future music.
| * By the finalizer, if the slot is released after the MSS object | ||
| is closed | ||
|
|
||
| If the connection is being closed (rather than just falling back |
There was a problem hiding this comment.
I don't feel this part of the doc string correctly reflects what the code does. This sounds more correct:
If the connection is not already being closed, we explicitly tell the server
that we're done with the memory region by calling shm_detach. Conversely, during
connection close, we skip explicit detach and let the server clean up the SHM
resources when the connection is closed.
Summary
I’ve spoken at length about the importance of avoiding copies. This PR is to eliminate the remaining (CPU-side) copy: copying from the OS-supplied buffer to a Python byte / bytearray object.
We introduce buffer lifetime tracking in MSS so backends can safely reclaim or reuse screenshot buffers when downstream consumers are truly done with them. On GNU/Linux, the XShmGetImage backend now uses that mechanism to enable zero-copy screenshot buffers on Python 3.12 and newer.
Benchmark
I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset, DDR4-2133, RTX 3090). I captured 1000 iterations of 3840×2160 screenshots as quickly as possible while forcing all pixel data to be read (using a NumPy sum). I ran A/B tests of enabling or disabling the new feature, taking a best-of-three test.
Capture time decreased from 22.64 ms to 18.59 ms per frame. Put differently, this is from 44 FPS to 54 FPS. This is approximately 18% faster.
Why
Previously, backends had to copy screenshot data into fresh Python-owned buffers to avoid reusing memory that might still be referenced by NumPy, Pillow, or other buffer consumers. That copy cost is significant for large captures and high frame-rate use cases.
What changed
Internal infrastructure
This change keeps MSS user-facing behavior the same while improving backend memory handling and performance.
XShmGetImage backend
Behavior by runtime
Testing
Notes for maintainers
Changes proposed in this PR
Fixes #424 (I accidentally said this was a duplicate of #476, but it's actually separate)
May be relevant to #222, as it lays the groundwork to lower CPU usage. However, this PR doesn't affect Windows.
./check.shpassed