executorch: derive the TensorRT delegate target_device from the engine's real device index by shoumikhin · Pull Request #4329 · pytorch/TensorRT

shoumikhin · 2026-06-09T19:40:36Z

What

TensorRTPartitioner hardcoded CompileSpec("target_device", b"cuda:0") for every partition, so an engine built for cuda:N shipped a .pte whose delegate-boundary tensors were labeled cuda:0. ExecuTorch's PropagateDevicePass serializes that into extra_tensor_info.device_type/device_index.

This derives target_device per export from the engine's real device index, reusing the backend's own _get_engine_info_from_edge_program + _parse_device_id so the index cannot drift from the runtime blob. It falls back to cuda:0 when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-provided target_device is used verbatim, unchanged.

It also documents that target_device is AOT-only metadata (the runtime selects the GPU from the engine blob).

Why now

The runtime already runs on the correct GPU (it reads the engine blob's device_id), so this is a latent metadata bug today. ExecuTorch is moving device-aware memory planning toward the default, and that planning reads device_index to place buffers; this makes the metadata correct before it becomes load-bearing.

Risk

Low. The common single-GPU cuda:0 export is byte-identical (the derived value is still cuda:0); the only behavior change is that a single-engine non-zero-GPU export now labels the GPU the engine was actually built for. Any extraction failure falls back to the previous cuda:0. The explicit-override path is untouched.

Test plan

Single-engine export on cuda:0: .pte unchanged.
Single-engine export on a non-zero GPU: delegate-boundary tensors record that device index.
Explicit CompileSpec("target_device", b"cuda:3"): used verbatim for every partition.
Fallback: a program without exactly one engine node yields cuda:0.

Draft for review; pairs with the runtime PR #4328 and a device-memory-planning readiness test to follow.

…e's real device index TensorRTPartitioner hardcoded target_device=cuda:0 for every partition, so a cuda:N engine shipped a .pte whose delegate-boundary tensors were labeled cuda:0. The runtime still ran on the correct GPU (it reads the device from the engine blob), but ExecuTorch's device-aware memory planning reads this metadata to place buffers, so the label needs to be correct once that planning becomes the default. Derive target_device per export from the engine's real device index, reusing the backend's own _get_engine_info_from_edge_program + _parse_device_id so the index cannot drift from the runtime blob. Fall back to cuda:0 when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-provided target_device is used verbatim, unchanged. Also document that target_device is AOT-only metadata: the runtime selects the GPU from the serialized engine blob, not from this value.

Covers the per-export device derivation: target_device taken from the engine's real device index, a cuda:0 fallback when the engine info is unreadable (for example multiple TRT partitions), and an explicit caller-provided target_device used verbatim. CPU-only unit test that monkeypatches the capability partitioner and engine-info extraction, matching the existing tests/py/dynamo/executorch style.

meta-cla Bot added the cla signed label Jun 9, 2026

github-actions Bot added the component: api [Python] Issues re: Python API label Jun 9, 2026

github-actions Bot requested a review from zewenli98 June 9, 2026 19:41

github-actions Bot added the component: tests Issues re: Tests label Jun 9, 2026

shoumikhin force-pushed the et-trt-partitioner-device branch from af53034 to 3371ec1 Compare June 9, 2026 20:05

shoumikhin force-pushed the et-trt-partitioner-device branch from 3371ec1 to f9892b2 Compare June 9, 2026 20:23

shoumikhin force-pushed the et-trt-partitioner-device branch from f9892b2 to 7d40e29 Compare June 9, 2026 20:35

shoumikhin marked this pull request as ready for review June 9, 2026 20:55

lanluo-nvidia merged commit 440a430 into pytorch:main Jun 10, 2026
126 of 150 checks passed

lanluo-nvidia self-assigned this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executorch: derive the TensorRT delegate target_device from the engine's real device index#4329

executorch: derive the TensorRT delegate target_device from the engine's real device index#4329
lanluo-nvidia merged 2 commits into
pytorch:mainfrom
shoumikhin:et-trt-partitioner-device

shoumikhin commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shoumikhin commented Jun 9, 2026

What

Why now

Risk

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants