Make the CUDA caller-stream guard a shared extension/cuda library (#20158) by shoumikhin · Pull Request #20158 · pytorch/executorch

shoumikhin · 2026-06-09T16:09:05Z

Summary:

Move the caller-stream handshake (CallerStreamGuard + getCallerStream()) out of the CUDA backend's backends/aoti/slim/cuda/guard into a standalone extension/cuda library, and build that library as SHARED so several CUDA backends can share one caller-selected stream.

The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building extension_cuda as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived.

The two public functions are exported through a visibility macro (extension/cuda/export.h, mirroring backends/aoti/export.h) while the thread-local stays internal to the library. The C++ API is used directly: getCallerStream() returns std::optional<cudaStream_t>, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it.

Differential Revision: D108023495

pytorch-bot · 2026-06-09T16:09:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20158

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 144 Pending

As of commit 833360e with merge base 6ca98b3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-09T16:09:15Z

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108023495.

Copilot

Pull request overview

This PR extracts the caller-selected CUDA stream handshake (CallerStreamGuard / getCallerStream()) into a new standalone extension/cuda library and wires CUDA backends to honor that caller stream (ensuring a single process-wide thread-local by building/linking as a shared library).

Changes:

Add a new extension/cuda shared library (extension_cuda) that exports CallerStreamGuard / getCallerStream() for cross-backend stream selection.
Update the CUDA backend runtime and SlimTensor CUDA paths to run work (and blocking memcpy semantics) on the caller-selected stream when present, with safe restoration of prior stream state.
Extend CUDA stream guard utilities with non-creating “peek” + “clear” APIs and add unit tests for the new stream handshake and peek/clear behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
extension/cuda/targets.bzl	Adds Buck target for the new caller-stream library.
extension/cuda/TARGETS	Hooks common targets for fbcode builds.
extension/cuda/BUCK	Hooks common targets for OSS/xplat builds.
extension/cuda/export.h	Defines symbol visibility/export macros for `extension_cuda`.
extension/cuda/CMakeLists.txt	Builds and installs `extension_cuda` as a SHARED library.
extension/cuda/caller_stream.h	Public API for `CallerStreamGuard` / `getCallerStream()`.
extension/cuda/caller_stream.cpp	Implements the thread-local caller stream handshake.
CMakeLists.txt	Adds `extension/cuda` subdir + installs headers; registers `extension_cuda` as an extension.
backends/cuda/runtime/TARGETS	Links CUDA runtime backend against the new caller-stream target.
backends/cuda/runtime/cuda_backend.cpp	Uses `getCallerStream()` to select/restore execution stream; blocks caller-stream + CUDA graph together.
backends/cuda/CMakeLists.txt	Links `aoti_cuda_backend` against `extension_cuda`.
backends/aoti/slim/cuda/test/test_cuda_stream_guard.cpp	Adds tests for `CallerStreamGuard` and for peek/clear stream registry behavior.
backends/aoti/slim/cuda/test/targets.bzl	Adds test dependency on the new caller-stream target.
backends/aoti/slim/cuda/guard.h	Declares `peekCurrentCUDAStream()` and `clearCurrentCUDAStream()`.
backends/aoti/slim/cuda/guard.cpp	Implements `peekCurrentCUDAStream()` and `clearCurrentCUDAStream()` over thread-local registry.
backends/aoti/slim/core/targets.bzl	Adds dependency on caller-stream library for SlimTensor core.
backends/aoti/slim/core/storage.h	Uses caller stream to keep `memcpy` semantics confined (green context) when a caller stream is active.
backends/aoti/CMakeLists.txt	Links SlimTensor interface to `extension_cuda` when CUDA is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-06-09T16:17:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

…torch#20158) Summary: Move the caller-stream handshake (`CallerStreamGuard` + `getCallerStream()`) out of the CUDA backend's `backends/aoti/slim/cuda/guard` into a standalone `extension/cuda` library, and build that library as SHARED so several CUDA backends can share one caller-selected stream. The handshake is a process-wide thread-local: the caller records the stream it wants, and each backend reads it. That only works if there is exactly one copy of the thread-local in the process. If the library were static and linked into two shared objects (for example the CUDA backend and a TensorRT delegate, each whole-archived for backend registration), each shared object would get its own copy, so the caller would write one and the backend would read the other and silently ignore the caller's stream. Building `extension_cuda` as SHARED gives one definition that every consumer references. It must be linked PUBLIC and never whole-archived. The two public functions are exported through a visibility macro (`extension/cuda/export.h`, mirroring `backends/aoti/export.h`) while the thread-local stays internal to the library. The C++ API is used directly: `getCallerStream()` returns `std::optional<cudaStream_t>`, a trivially copyable pointer and bool that does not depend on the libstdc++ CXX11 ABI, so no C ABI is needed. The header is installed so an external project (such as a TensorRT delegate) can include it. Differential Revision: D108023495

extension/cuda/CMakeLists.txt applied ${_common_compile_options} PUBLIC without a $<COMPILE_LANGUAGE:CXX> guard. After #20158 wired extension_cuda into slimtensor's INTERFACE (backends/aoti/CMakeLists.txt), that option (/wd4996 on MSVC) propagates transitively into the aoti_cuda_shims .cu compile. nvcc receives a bare /wd4996 and treats it as a second input, failing with 'nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified'. That breaks the Windows CUDA build and install, so executorchConfig.cmake is never produced and the e2e model runners fail at find_package(executorch). Guard the options with $<COMPILE_LANGUAGE:CXX>, matching every other target in the CUDA stack (backends/cuda, backends/aoti).

extension/cuda feeds the CUDA build (it is linked into slimtensor / aoti_cuda_backend), so changes there can break the Windows CUDA build (as #20158 did) but were not triggering this workflow. Add extension/cuda to the pull_request paths and the per-job changed-files conditions.

…20184) Guard extension_cuda's ${_common_compile_options} with $<COMPILE_LANGUAGE:CXX> so the MSVC /wd4996 flag no longer leaks (via slimtensor INTERFACE, added in #20158) into the aoti_cuda_shims .cu nvcc compile, which failed with 'nvcc fatal: A single input file is required'. Also run the cuda-windows workflow on extension/cuda changes. Verified: Windows CUDA e2e 5/6 green (was 0/6).

Copilot AI review requested due to automatic review settings June 9, 2026 16:09

shoumikhin requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 16:09

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

meta-codesync Bot added the meta-exported label Jun 9, 2026

Copilot started reviewing on behalf of shoumikhin June 9, 2026 16:09 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread extension/cuda/targets.bzl

meta-codesync Bot changed the title ~~Make the CUDA caller-stream guard a shared extension/cuda library~~ Make the CUDA caller-stream guard a shared extension/cuda library (#20158) Jun 9, 2026

shoumikhin force-pushed the export-D108023495 branch from 32eadf3 to 6eaeaa8 Compare June 9, 2026 16:41

shoumikhin force-pushed the export-D108023495 branch from 6eaeaa8 to 4a14aa5 Compare June 9, 2026 17:16

Copilot AI review requested due to automatic review settings June 9, 2026 17:16

shoumikhin force-pushed the export-D108023495 branch 2 times, most recently from 15c98ef to 4a14aa5 Compare June 9, 2026 17:16

Copilot started reviewing on behalf of shoumikhin June 9, 2026 17:17 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread extension/cuda/CMakeLists.txt Outdated

Comment thread extension/cuda/CMakeLists.txt

Comment thread extension/cuda/caller_stream.h Outdated

Copilot AI review requested due to automatic review settings June 9, 2026 17:29

shoumikhin force-pushed the export-D108023495 branch from 4a14aa5 to 90c4d03 Compare June 9, 2026 17:29

Copilot started reviewing on behalf of shoumikhin June 9, 2026 17:29 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread extension/cuda/targets.bzl

shoumikhin force-pushed the export-D108023495 branch from 90c4d03 to 46c7e3e Compare June 9, 2026 17:49

shoumikhin added the ciflow/cuda label Jun 9, 2026

Copilot AI review requested due to automatic review settings June 10, 2026 04:28

shoumikhin force-pushed the export-D108023495 branch from 46c7e3e to a3a974e Compare June 10, 2026 04:28

Copilot started reviewing on behalf of shoumikhin June 10, 2026 04:28 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread extension/cuda/targets.bzl Outdated

shoumikhin force-pushed the export-D108023495 branch from a3a974e to 833360e Compare June 10, 2026 04:46

Gasoonjia approved these changes Jun 10, 2026

View reviewed changes

shoumikhin merged commit 0b13b6a into pytorch:main Jun 10, 2026
227 of 237 checks passed

shoumikhin mentioned this pull request Jun 10, 2026

Fix Windows CUDA build: guard extension_cuda compile options to CXX #20184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the CUDA caller-stream guard a shared extension/cuda library (#20158)#20158

Make the CUDA caller-stream guard a shared extension/cuda library (#20158)#20158
shoumikhin merged 1 commit into
pytorch:mainfrom
shoumikhin:export-D108023495

shoumikhin commented Jun 9, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shoumikhin commented Jun 9, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20158

⏳ No Failures, 144 Pending

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shoumikhin commented Jun 9, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label