Skip to content

[ET Device Support] CUDA-native Qwen 3.5 MoE inference with device tensor pipeline#18788

Open
Gasoonjia wants to merge 7 commits into
gh/gasoonjia/164/basefrom
gh/gasoonjia/164/head
Open

[ET Device Support] CUDA-native Qwen 3.5 MoE inference with device tensor pipeline#18788
Gasoonjia wants to merge 7 commits into
gh/gasoonjia/164/basefrom
gh/gasoonjia/164/head

Conversation

@Gasoonjia

@Gasoonjia Gasoonjia commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

  • Export: Multi-method export (forward + sample) with device memory
    planning enabled and method-level H2D/D2H skipping.
  • Runner: Custom CUDA-native inference loop that keeps logits on GPU
    between forward and sample, reuses CUDA tensors across iterations,
    and only copies the 8-byte token ID back to CPU for EOS checking.

Differential Revision: D100133933

…nsor pipeline

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Apr 9, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18788

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure, 1 Unclassified Failure

As of commit e6785b1 with merge base b47e588 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

  • Test CUDA Builds / unittest-cuda / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
    examples/models/qwen3_5_moe/test_sampler.py::TestSampler::test_output_shape_and_dtype

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2026
Gasoonjia added a commit that referenced this pull request Apr 9, 2026
…nsor pipeline

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)

ghstack-source-id: 364764771
Pull Request resolved: #18788
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…h device tensor pipeline"

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Apr 9, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 364908062
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request May 27, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 386793196
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
Gasoonjia added a commit that referenced this pull request May 28, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)
(oldest at bottom):
* #18788
* #18762
* #19828
* #19582
* #18760
* #18730
* __->__ #18729
* #18761

Implement C++ runtime kernels for device copy ops using DeviceAllocator:
- h2d_copy_out: infers device from out tensor, calls
  DeviceAllocator::copy_host_to_device
- d2h_copy_out: infers device from self tensor, calls
  DeviceAllocator::copy_device_to_host
- Registered via EXECUTORCH_LIBRARY macro

Differential Revision:
[D99636776](https://our.internmc.facebook.com/intern/diff/D99636776/)
[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Jun 11, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392289733
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Jun 11, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392547561
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Jun 12, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392736326
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Jun 12, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392754115
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
Gasoonjia added a commit that referenced this pull request Jun 12, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392892796
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
Gasoonjia added a commit that referenced this pull request Jun 12, 2026
…nsor pipeline

Pull Request resolved: #18788

Integrate the ET device tensor pipeline into the Qwen 3.5 MoE model to
eliminate unnecessary H2D/D2H copies during inference.

- Export: Multi-method export (`forward` + `sample`) with device memory
  planning enabled and method-level H2D/D2H skipping.
- Runner: Custom CUDA-native inference loop that keeps logits on GPU
  between forward and sample, reuses CUDA tensors across iterations,
  and only copies the 8-byte token ID back to CPU for EOS checking.
ghstack-source-id: 392892796
@exported-using-ghexport

Differential Revision: [D100133933](https://our.internmc.facebook.com/intern/diff/D100133933/)
Gasoonjia added a commit that referenced this pull request Jun 12, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0)
(oldest at bottom):
* #18788
* __->__ #20214
* #19828
* #19582
* #18760
* #18730
* #18761

This diff turn on the on-device memory planing as default, so that every
delegate which enables on device memory planing will be use that./

Also update cuda backend to remove H2D/D2H copies and extra caches

Differential Revision:
[D107597774](https://our.internmc.facebook.com/intern/diff/D107597774/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant