Skip to content

executorch: derive the TensorRT delegate target_device from the engine's real device index#4329

Merged
lanluo-nvidia merged 2 commits into
pytorch:mainfrom
shoumikhin:et-trt-partitioner-device
Jun 10, 2026
Merged

executorch: derive the TensorRT delegate target_device from the engine's real device index#4329
lanluo-nvidia merged 2 commits into
pytorch:mainfrom
shoumikhin:et-trt-partitioner-device

Conversation

@shoumikhin

Copy link
Copy Markdown
Contributor

What

TensorRTPartitioner hardcoded CompileSpec("target_device", b"cuda:0") for every partition, so an engine built for cuda:N shipped a .pte whose delegate-boundary tensors were labeled cuda:0. ExecuTorch's PropagateDevicePass serializes that into extra_tensor_info.device_type/device_index.

This derives target_device per export from the engine's real device index, reusing the backend's own _get_engine_info_from_edge_program + _parse_device_id so the index cannot drift from the runtime blob. It falls back to cuda:0 when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-provided target_device is used verbatim, unchanged.

It also documents that target_device is AOT-only metadata (the runtime selects the GPU from the engine blob).

Why now

The runtime already runs on the correct GPU (it reads the engine blob's device_id), so this is a latent metadata bug today. ExecuTorch is moving device-aware memory planning toward the default, and that planning reads device_index to place buffers; this makes the metadata correct before it becomes load-bearing.

Risk

Low. The common single-GPU cuda:0 export is byte-identical (the derived value is still cuda:0); the only behavior change is that a single-engine non-zero-GPU export now labels the GPU the engine was actually built for. Any extraction failure falls back to the previous cuda:0. The explicit-override path is untouched.

Test plan

  • Single-engine export on cuda:0: .pte unchanged.
  • Single-engine export on a non-zero GPU: delegate-boundary tensors record that device index.
  • Explicit CompileSpec("target_device", b"cuda:3"): used verbatim for every partition.
  • Fallback: a program without exactly one engine node yields cuda:0.

Draft for review; pairs with the runtime PR #4328 and a device-memory-planning readiness test to follow.

@meta-cla meta-cla Bot added the cla signed label Jun 9, 2026
@github-actions github-actions Bot added the component: api [Python] Issues re: Python API label Jun 9, 2026
@github-actions github-actions Bot requested a review from zewenli98 June 9, 2026 19:41
@github-actions github-actions Bot added the component: tests Issues re: Tests label Jun 9, 2026
@shoumikhin shoumikhin force-pushed the et-trt-partitioner-device branch from af53034 to 3371ec1 Compare June 9, 2026 20:05
…e's real device index

TensorRTPartitioner hardcoded target_device=cuda:0 for every partition, so a cuda:N engine shipped a .pte whose delegate-boundary tensors were labeled cuda:0. The runtime still ran on the correct GPU (it reads the device from the engine blob), but ExecuTorch's device-aware memory planning reads this metadata to place buffers, so the label needs to be correct once that planning becomes the default.

Derive target_device per export from the engine's real device index, reusing the backend's own _get_engine_info_from_edge_program + _parse_device_id so the index cannot drift from the runtime blob. Fall back to cuda:0 when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-provided target_device is used verbatim, unchanged.

Also document that target_device is AOT-only metadata: the runtime selects the GPU from the serialized engine blob, not from this value.
@shoumikhin shoumikhin force-pushed the et-trt-partitioner-device branch from 3371ec1 to f9892b2 Compare June 9, 2026 20:23
Covers the per-export device derivation: target_device taken from the engine's
real device index, a cuda:0 fallback when the engine info is unreadable (for
example multiple TRT partitions), and an explicit caller-provided target_device
used verbatim. CPU-only unit test that monkeypatches the capability partitioner
and engine-info extraction, matching the existing tests/py/dynamo/executorch
style.
@shoumikhin shoumikhin force-pushed the et-trt-partitioner-device branch from f9892b2 to 7d40e29 Compare June 9, 2026 20:35
@shoumikhin shoumikhin marked this pull request as ready for review June 9, 2026 20:55
@lanluo-nvidia lanluo-nvidia merged commit 440a430 into pytorch:main Jun 10, 2026
126 of 150 checks passed
@lanluo-nvidia lanluo-nvidia self-assigned this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants