Add TensorRT weight streaming support to the ExecuTorch delegate#4336
Add TensorRT weight streaming support to the ExecuTorch delegate#4336shoumikhin wants to merge 1 commit into
Conversation
983583d to
053e902
Compare
|
Good call, and I agree the budget should be a runtime setting. Here is what the PR does today, and a change that gives you exactly what you are asking for. What happens today
So the only thing fixed at export is the explicit number, and you are right that a fixed byte count is really a per-deployment choice. Proposed change (a real load-time API)
Why keep the export-time value too ExecuTorch's Python and Android load paths do not expose backend options yet (only C++ Changing it after load (the "or later" part) This is doable as a follow-up, not in this PR. TensorRT requires destroying and recreating the execution context to change the budget, and ExecuTorch's post-load One question so I build the right thing Will these large models be served mainly from the C++ runtime or from Python? If C++, the load-time option covers it and we can treat the baked value as just a default. If Python, we need to keep the baked value until ExecuTorch exposes backend options to Python, which I am happy to help add upstream. |
053e902 to
c354c29
Compare
Apply a TensorRT weight streaming budget in the delegate init(), after the engine
is deserialized and before the execution context is created (the budget cannot be
changed once a context exists). When the engine was built with weight streaming,
the delegate applies TensorRT's automatic budget by default, mirroring the PyTorch
runtimes.
An explicit budget can be set two ways, in order of precedence: a load-time
ExecuTorch backend option ("weight_streaming_budget" runtime spec passed to
Module::load), or the same key baked into the .pte at export via
torch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). The
load-time option lets a deployment size the budget for its own GPU without
re-exporting; the export-time value is the default and the only channel for
loaders that cannot pass backend options yet (Python, Android). The value is a
non-negative decimal byte count, string-encoded on the wire.
Non-streamable engines make no budget call, so existing programs are unchanged.
The one intended behavior change is that a streamable engine now applies the
automatic budget on load, which enables running models whose weights exceed GPU
memory.
Ref pytorch#4334
c354c29 to
a996b54
Compare
cehongwang
left a comment
There was a problem hiding this comment.
Overall OK. Some minor comments
| return Error::InvalidProgram; | ||
| } | ||
| is_explicit = true; | ||
| } |
There was a problem hiding this comment.
What happens if len = 0? Give a warning to the user
| resolved_compile_specs = _resolve_executorch_compile_specs( | ||
| exp_program, | ||
| list(executorch_compile_specs), | ||
| kwargs.get("weight_streaming_budget"), |
There was a problem hiding this comment.
Here we expect users to set the total budget right? If there is a graph break, is there a correct mapping to the per-engine budget?
Summary
Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate, so a model whose weights do not all fit in GPU memory can run when exported to an ExecuTorch program. Ref #4334.
Torch-TensorRT already builds a weight streamable engine when you compile with
enable_weight_streaming=True, but the ExecuTorch delegate never set a budget on the engine at load time, so large models could not stream. This change sets the budget in the delegateinit(), after the engine is deserialized and before the execution context is created, which is the same pattern the other Torch-TensorRT runtimes already use.How it works
By default the delegate applies TensorRT's automatic budget, computed at load time from the free memory on the actual GPU, gated on
getStreamableWeightsSize() > 0. So an engine built withenable_weight_streaming=Trueruns out of the box and adapts to the deploy device. Nothing is baked into the.ptefor this default case.An explicit budget is a non-negative number of bytes and can be set two ways, in order of precedence:
weight_streaming_budget, passed by the caller viaModule::load(LoadBackendOptionsMap)and read ininit()withBackendInitContext::get_runtime_spec. This lets a deployment size the budget for its own GPU without re-exporting. It is the same load-time pattern CoreML and XNNPACK use..pteviatorch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). This is used when no load-time option is given, and it is the only channel for loaders that cannot pass backend options yet (the ExecuTorch Python and Android runtimes).Resolution order in the delegate is: load-time option, then the baked value, then automatic. The value is a decimal string on the wire because ExecuTorch's typed integer option is only 32 bit and a byte budget can exceed 2 GB.
Changes
TensorRTBackend::initapplies the budget viasetWeightStreamingBudgetV2before creating the execution context, gated ongetStreamableWeightsSize() > 0. It resolves the budget as load-time runtime spec, then baked compile spec, then automatic.WeightStreamingBudgetparser (cpp/includeandcpp/src), unit tested without a GPU. It usesstd::from_charsand accepts only a non-negative decimal integer.save(..., weight_streaming_budget=...)writes the export-time default; validation lives in_compile.py. Passing the budget throughcompile_specsis rejected in favor of the keyword argument.Requirements
The load-time override uses ExecuTorch's
BackendInitContext::get_runtime_specandLoadBackendOptionsMap. The export-time default and the automatic budget work without it.Backward compatibility
.pte..pteformat or the engine blob.Edge cases
Error::InvalidProgram).weight_streaming_budgetasNonefor multi-engine models.Status and validation
Follow-ups
LoadBackendOptionsMapin ExecuTorch's Python (and Android) runtime bindings so non-C++ loaders can also set the budget at load. Until then the export-time default covers them.set_optionAPI. This needs the execution context to be destroyed and recreated, likeTRTEngine::set_device_memory_budget, so it is deferred.