executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract#4328
Conversation
…the device-handling contract The ExecuTorch TensorRT delegate binds device-resident I/O pointers straight into the execution context (zero-copy). When an I/O tensor is host memory it silently stages through a per-call device allocation plus cudaMemcpyAsync, with no signal. That is an invisible performance cliff, and a silent regression hazard once device-aware memory planning is expected to keep delegate I/O on device. Emit a single Info log the first time an engine stages host I/O. Also document, at execute()'s header, the delegate's deliberate device-handling contract (runtime pointer-sniffing instead of reading the AOT device metadata, self-managed host/device staging, and the engine-baked device_id as the runtime source of truth) so a future change does not "reconcile" it into breakage.
2660c91 to
9e64259
Compare
|
@narendasan @lanluo-nvidia — would appreciate a review when you have a chance. This adds a one-shot The red CI is the current repo-wide Python-3.10 / Windows infra outage — |
What
Two small, self-contained changes to the ExecuTorch TensorRT delegate runtime (
cpp/.../executorch/TensorRTBackend.{h,cpp}):Observability for the zero-copy path.
execute()binds device-resident I/O straight into the execution context (zero-copy) and silently stages host-resident I/O through a per-call device allocation +cudaMemcpyAsync, with no signal. This adds a one-shotET_LOG(Info)(guarded by a newstaged_warnedflag onEngineHandle, read/written under the existingmulock) the first time an engine stages host I/O, so a caller intending device-resident, zero-copy I/O can tell when they have fallen off the fast path. Purely additive.Document the device-handling contract. A WHY-only comment at
execute()'s header records three intentional choices that diverge from the CUDA/AOTI delegate: (a) the runtime ignores the AOTtarget_devicemetadata and sniffs pointers at runtime (so assertingdevice_typewould wrongly reject host inputs today), (b) it stages H2D/D2H itself rather than via ExecuTorch device-copy ops / aDeviceAllocator, and (c) the engine-bakeddevice_idis the runtime source of truth for the GPU.Why now
ExecuTorch is moving on-device memory planning toward the default. Once that lands, device-aware planning is meant to keep delegate I/O on device; a regression that reinserts a host op between two GPU delegates would silently restore copies. The one-shot log makes that visible, and the contract comment keeps the intentional divergences from being "fixed" into breakage.
Risk
Low. (1) is additive logging guarded by a flag under the existing lock; (2) is a comment. No fast-path or control-flow change.
Test plan
execute().Draft for review; part of a small device-support readiness series (a partitioner-side
target_devicefix + a readiness test follow).