executorch: derive the TensorRT delegate target_device from the engine's real device index#4329
Merged
lanluo-nvidia merged 2 commits intoJun 10, 2026
Merged
Conversation
af53034 to
3371ec1
Compare
…e's real device index TensorRTPartitioner hardcoded target_device=cuda:0 for every partition, so a cuda:N engine shipped a .pte whose delegate-boundary tensors were labeled cuda:0. The runtime still ran on the correct GPU (it reads the device from the engine blob), but ExecuTorch's device-aware memory planning reads this metadata to place buffers, so the label needs to be correct once that planning becomes the default. Derive target_device per export from the engine's real device index, reusing the backend's own _get_engine_info_from_edge_program + _parse_device_id so the index cannot drift from the runtime blob. Fall back to cuda:0 when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-provided target_device is used verbatim, unchanged. Also document that target_device is AOT-only metadata: the runtime selects the GPU from the serialized engine blob, not from this value.
3371ec1 to
f9892b2
Compare
Covers the per-export device derivation: target_device taken from the engine's real device index, a cuda:0 fallback when the engine info is unreadable (for example multiple TRT partitions), and an explicit caller-provided target_device used verbatim. CPU-only unit test that monkeypatches the capability partitioner and engine-info extraction, matching the existing tests/py/dynamo/executorch style.
f9892b2 to
7d40e29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
TensorRTPartitionerhardcodedCompileSpec("target_device", b"cuda:0")for every partition, so an engine built forcuda:Nshipped a.ptewhose delegate-boundary tensors were labeledcuda:0. ExecuTorch'sPropagateDevicePassserializes that intoextra_tensor_info.device_type/device_index.This derives
target_deviceper export from the engine's real device index, reusing the backend's own_get_engine_info_from_edge_program+_parse_device_idso the index cannot drift from the runtime blob. It falls back tocuda:0when the program does not have exactly one engine node (multiple TRT partitions) or the index is unreadable; per-partition multi-GPU labeling is left to a follow-up. An explicit caller-providedtarget_deviceis used verbatim, unchanged.It also documents that
target_deviceis AOT-only metadata (the runtime selects the GPU from the engine blob).Why now
The runtime already runs on the correct GPU (it reads the engine blob's
device_id), so this is a latent metadata bug today. ExecuTorch is moving device-aware memory planning toward the default, and that planning readsdevice_indexto place buffers; this makes the metadata correct before it becomes load-bearing.Risk
Low. The common single-GPU
cuda:0export is byte-identical (the derived value is stillcuda:0); the only behavior change is that a single-engine non-zero-GPU export now labels the GPU the engine was actually built for. Any extraction failure falls back to the previouscuda:0. The explicit-override path is untouched.Test plan
cuda:0:.pteunchanged.CompileSpec("target_device", b"cuda:3"): used verbatim for every partition.cuda:0.Draft for review; pairs with the runtime PR #4328 and a device-memory-planning readiness test to follow.