Make the CUDA caller-stream guard a shared extension/cuda library (#20158)#20158
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20158
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 144 PendingAs of commit 833360e with merge base 6ca98b3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108023495. |
There was a problem hiding this comment.
Pull request overview
This PR extracts the caller-selected CUDA stream handshake (CallerStreamGuard / getCallerStream()) into a new standalone extension/cuda library and wires CUDA backends to honor that caller stream (ensuring a single process-wide thread-local by building/linking as a shared library).
Changes:
- Add a new
extension/cudashared library (extension_cuda) that exportsCallerStreamGuard/getCallerStream()for cross-backend stream selection. - Update the CUDA backend runtime and SlimTensor CUDA paths to run work (and blocking memcpy semantics) on the caller-selected stream when present, with safe restoration of prior stream state.
- Extend CUDA stream guard utilities with non-creating “peek” + “clear” APIs and add unit tests for the new stream handshake and peek/clear behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| extension/cuda/targets.bzl | Adds Buck target for the new caller-stream library. |
| extension/cuda/TARGETS | Hooks common targets for fbcode builds. |
| extension/cuda/BUCK | Hooks common targets for OSS/xplat builds. |
| extension/cuda/export.h | Defines symbol visibility/export macros for extension_cuda. |
| extension/cuda/CMakeLists.txt | Builds and installs extension_cuda as a SHARED library. |
| extension/cuda/caller_stream.h | Public API for CallerStreamGuard / getCallerStream(). |
| extension/cuda/caller_stream.cpp | Implements the thread-local caller stream handshake. |
| CMakeLists.txt | Adds extension/cuda subdir + installs headers; registers extension_cuda as an extension. |
| backends/cuda/runtime/TARGETS | Links CUDA runtime backend against the new caller-stream target. |
| backends/cuda/runtime/cuda_backend.cpp | Uses getCallerStream() to select/restore execution stream; blocks caller-stream + CUDA graph together. |
| backends/cuda/CMakeLists.txt | Links aoti_cuda_backend against extension_cuda. |
| backends/aoti/slim/cuda/test/test_cuda_stream_guard.cpp | Adds tests for CallerStreamGuard and for peek/clear stream registry behavior. |
| backends/aoti/slim/cuda/test/targets.bzl | Adds test dependency on the new caller-stream target. |
| backends/aoti/slim/cuda/guard.h | Declares peekCurrentCUDAStream() and clearCurrentCUDAStream(). |
| backends/aoti/slim/cuda/guard.cpp | Implements peekCurrentCUDAStream() and clearCurrentCUDAStream() over thread-local registry. |
| backends/aoti/slim/core/targets.bzl | Adds dependency on caller-stream library for SlimTensor core. |
| backends/aoti/slim/core/storage.h | Uses caller stream to keep memcpy semantics confined (green context) when a caller stream is active. |
| backends/aoti/CMakeLists.txt | Links SlimTensor interface to extension_cuda when CUDA is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR needs a
|
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
32eadf3 to
6eaeaa8
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
6eaeaa8 to
4a14aa5
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
15c98ef to
4a14aa5
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
4a14aa5 to
90c4d03
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
90c4d03 to
46c7e3e
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
46c7e3e to
a3a974e
Compare
…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495
a3a974e to
833360e
Compare
extension/cuda/CMakeLists.txt applied ${_common_compile_options} PUBLIC without a $<COMPILE_LANGUAGE:CXX> guard. After #20158 wired extension_cuda into slimtensor's INTERFACE (backends/aoti/CMakeLists.txt), that option (/wd4996 on MSVC) propagates transitively into the aoti_cuda_shims .cu compile. nvcc receives a bare /wd4996 and treats it as a second input, failing with 'nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified'. That breaks the Windows CUDA build and install, so executorchConfig.cmake is never produced and the e2e model runners fail at find_package(executorch).
Guard the options with $<COMPILE_LANGUAGE:CXX>, matching every other target in the CUDA stack (backends/cuda, backends/aoti).
extension/cuda feeds the CUDA build (it is linked into slimtensor / aoti_cuda_backend), so changes there can break the Windows CUDA build (as #20158 did) but were not triggering this workflow. Add extension/cuda to the pull_request paths and the per-job changed-files conditions.
…20184) Guard extension_cuda's ${_common_compile_options} with $<COMPILE_LANGUAGE:CXX> so the MSVC /wd4996 flag no longer leaks (via slimtensor INTERFACE, added in #20158) into the aoti_cuda_shims .cu nvcc compile, which failed with 'nvcc fatal: A single input file is required'. Also run the cuda-windows workflow on extension/cuda changes. Verified: Windows CUDA e2e 5/6 green (was 0/6).
Summary:
Move the caller-stream handshake (
CallerStreamGuard+getCallerStream()) out of the CUDA backend'sbackends/aoti/slim/cuda/guardinto a standaloneextension/cudalibrary, and build that library as SHARED so several CUDA backends can share one caller-selected stream.The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building
extension_cudaas SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.The two public functions are exported through a visibility macro (
extension/cuda/export.h, mirroringbackends/aoti/export.h) while the thread-local stays internal to the library. The C++ API is used directly:getCallerStream()returnsstd::optional<cudaStream_t>, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.Differential Revision: D108023495