Skip to content

Make the CUDA caller-stream guard a shared extension/cuda library (#20158)#20158

Merged
shoumikhin merged 1 commit into
pytorch:mainfrom
shoumikhin:export-D108023495
Jun 10, 2026
Merged

Make the CUDA caller-stream guard a shared extension/cuda library (#20158)#20158
shoumikhin merged 1 commit into
pytorch:mainfrom
shoumikhin:export-D108023495

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary:

Move the caller-stream handshake (CallerStreamGuard + getCallerStream()) out of the CUDA backend's backends/aoti/slim/cuda/guard into a standalone extension/cuda library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building extension_cuda as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (extension/cuda/export.h, mirroring backends/aoti/export.h) while the thread-local stays internal to the library. The C++ API is used directly: getCallerStream() returns std::optional<cudaStream_t>, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495

Copilot AI review requested due to automatic review settings June 9, 2026 16:09
@pytorch-bot

pytorch-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20158

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 144 Pending

As of commit 833360e with merge base 6ca98b3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108023495.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extracts the caller-selected CUDA stream handshake (CallerStreamGuard / getCallerStream()) into a new standalone extension/cuda library and wires CUDA backends to honor that caller stream (ensuring a single process-wide thread-local by building/linking as a shared library).

Changes:

  • Add a new extension/cuda shared library (extension_cuda) that exports CallerStreamGuard / getCallerStream() for cross-backend stream selection.
  • Update the CUDA backend runtime and SlimTensor CUDA paths to run work (and blocking memcpy semantics) on the caller-selected stream when present, with safe restoration of prior stream state.
  • Extend CUDA stream guard utilities with non-creating “peek” + “clear” APIs and add unit tests for the new stream handshake and peek/clear behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
extension/cuda/targets.bzl Adds Buck target for the new caller-stream library.
extension/cuda/TARGETS Hooks common targets for fbcode builds.
extension/cuda/BUCK Hooks common targets for OSS/xplat builds.
extension/cuda/export.h Defines symbol visibility/export macros for extension_cuda.
extension/cuda/CMakeLists.txt Builds and installs extension_cuda as a SHARED library.
extension/cuda/caller_stream.h Public API for CallerStreamGuard / getCallerStream().
extension/cuda/caller_stream.cpp Implements the thread-local caller stream handshake.
CMakeLists.txt Adds extension/cuda subdir + installs headers; registers extension_cuda as an extension.
backends/cuda/runtime/TARGETS Links CUDA runtime backend against the new caller-stream target.
backends/cuda/runtime/cuda_backend.cpp Uses getCallerStream() to select/restore execution stream; blocks caller-stream + CUDA graph together.
backends/cuda/CMakeLists.txt Links aoti_cuda_backend against extension_cuda.
backends/aoti/slim/cuda/test/test_cuda_stream_guard.cpp Adds tests for CallerStreamGuard and for peek/clear stream registry behavior.
backends/aoti/slim/cuda/test/targets.bzl Adds test dependency on the new caller-stream target.
backends/aoti/slim/cuda/guard.h Declares peekCurrentCUDAStream() and clearCurrentCUDAStream().
backends/aoti/slim/cuda/guard.cpp Implements peekCurrentCUDAStream() and clearCurrentCUDAStream() over thread-local registry.
backends/aoti/slim/core/targets.bzl Adds dependency on caller-stream library for SlimTensor core.
backends/aoti/slim/core/storage.h Uses caller stream to keep memcpy semantics confined (green context) when a caller stream is active.
backends/aoti/CMakeLists.txt Links SlimTensor interface to extension_cuda when CUDA is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread extension/cuda/targets.bzl
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Make the CUDA caller-stream guard a shared extension/cuda library Make the CUDA caller-stream guard a shared extension/cuda library (#20158) Jun 9, 2026
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 9, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
@shoumikhin shoumikhin force-pushed the export-D108023495 branch from 32eadf3 to 6eaeaa8 Compare June 9, 2026 16:41
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 9, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
@shoumikhin shoumikhin force-pushed the export-D108023495 branch from 6eaeaa8 to 4a14aa5 Compare June 9, 2026 17:16
Copilot AI review requested due to automatic review settings June 9, 2026 17:16
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 9, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
@shoumikhin shoumikhin force-pushed the export-D108023495 branch 2 times, most recently from 15c98ef to 4a14aa5 Compare June 9, 2026 17:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Comment thread extension/cuda/CMakeLists.txt Outdated
Comment thread extension/cuda/CMakeLists.txt
Comment thread extension/cuda/caller_stream.h Outdated
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 9, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
Copilot AI review requested due to automatic review settings June 9, 2026 17:29
@shoumikhin shoumikhin force-pushed the export-D108023495 branch from 4a14aa5 to 90c4d03 Compare June 9, 2026 17:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Comment thread extension/cuda/targets.bzl
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 9, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
@shoumikhin shoumikhin force-pushed the export-D108023495 branch from 90c4d03 to 46c7e3e Compare June 9, 2026 17:49
Copilot AI review requested due to automatic review settings June 10, 2026 04:28
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 10, 2026
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Comment thread extension/cuda/targets.bzl Outdated
…torch#20158)

Summary:

Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495
@shoumikhin shoumikhin merged commit 0b13b6a into pytorch:main Jun 10, 2026
227 of 237 checks passed
shoumikhin added a commit that referenced this pull request Jun 10, 2026
extension/cuda/CMakeLists.txt applied ${_common_compile_options} PUBLIC without a $<COMPILE_LANGUAGE:CXX> guard. After #20158 wired extension_cuda into slimtensor's INTERFACE (backends/aoti/CMakeLists.txt), that option (/wd4996 on MSVC) propagates transitively into the aoti_cuda_shims .cu compile. nvcc receives a bare /wd4996 and treats it as a second input, failing with 'nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified'. That breaks the Windows CUDA build and install, so executorchConfig.cmake is never produced and the e2e model runners fail at find_package(executorch).

Guard the options with $<COMPILE_LANGUAGE:CXX>, matching every other target in the CUDA stack (backends/cuda, backends/aoti).
shoumikhin added a commit that referenced this pull request Jun 10, 2026
extension/cuda feeds the CUDA build (it is linked into slimtensor / aoti_cuda_backend), so changes there can break the Windows CUDA build (as #20158 did) but were not triggering this workflow. Add extension/cuda to the pull_request paths and the per-job changed-files conditions.
shoumikhin added a commit that referenced this pull request Jun 10, 2026
…20184)

Guard extension_cuda's ${_common_compile_options} with $<COMPILE_LANGUAGE:CXX> so the MSVC /wd4996 flag no longer leaks (via slimtensor INTERFACE, added in #20158) into the aoti_cuda_shims .cu nvcc compile, which failed with 'nvcc fatal: A single input file is required'. Also run the cuda-windows workflow on extension/cuda changes. Verified: Windows CUDA e2e 5/6 green (was 0/6).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants