Add LFM2.5-VL export with CUDA/AOTI backend#18823
Add LFM2.5-VL export with CUDA/AOTI backend#18823vincentzed wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18823
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below:
|
|
Hi @vincentzed! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
This PR needs a
|
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Export LFM2.5-VL (450M and 1.6B) as a multi-method PTE with three methods: vision_encoder, token_embedding, and text_decoder, all delegated to the CUDA/AOTI backend. Key changes: - examples/models/lfm2_5_vl/: new model, weight converter, and export script for LFM2.5-VL on CUDA - examples/models/lfm2/short_conv.py: dual state management — state-as-IO for CUDA/AOTI (via attn_options["conv_states"]) with register_buffer fallback for XNNPack/portable backends - examples/models/llama/llama_transformer.py: pass layer_idx to ShortConvBlock for per-layer conv state keying - exir/emit/_emitter.py: copy CUDA tensor storage to CPU before ctypes pointer read to prevent segfault during serialization Tested on NVIDIA B300: 333-400 decode tok/s, 435-454 prefill tok/s, correct coherent generation on text-only and vision-language prompts. Also compatible with llama_main C++ runner.
e61e728 to
baf48bb
Compare
|
Hello @Gasoonjia. I realize there is no CC list. Do you think you could help give it a review, or point me to the right person. Thanks! in advance. |
There was a problem hiding this comment.
Pull request overview
Adds an ExecuTorch export path for LiquidAI’s LFM2.5-VL models targeting the CUDA/AOTI backend, including a new multi-method PTE export pipeline and the required model/runtime adaptations (notably conv-state handling for hybrid conv/attention layers).
Changes:
- Introduces
examples/models/lfm2_5_vl/(model wrapper, HF->ET weight remap, export script, and configs) to exportvision_encoder,token_embedding, andtext_decodermethods. - Updates LFM2 short-conv blocks to support “state-as-IO” via
attn_options["conv_states"]and addslayer_idxwiring from the transformer constructor. - Fixes constant serialization in the emitter by copying non-CPU storages to CPU before reading bytes.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| exir/emit/_emitter.py | Ensures constant tensor storage is moved to CPU before byte-serialization when storage is on a non-CPU device. |
| examples/models/llama/llama_transformer.py | Passes layer_idx into ShortConvBlock for per-layer conv state mapping. |
| examples/models/lfm2/short_conv.py | Refactors short conv to support explicit conv-state IO for AOTI and implements a manual depthwise conv path. |
| examples/models/lfm2_5_vl/model.py | Adds an ExecuTorch-friendly LFM2.5-VL wrapper integrating HF vision tower + ET text transformer. |
| examples/models/lfm2_5_vl/export_lfm2_5_vl.py | Adds CUDA/AOTI multi-method export pipeline producing PTE + PTD blob. |
| examples/models/lfm2_5_vl/convert_weights.py | Adds a HF->ET key remapping utility for text-decoder weights. |
| examples/models/lfm2_5_vl/config/lfm2_5_vl_450m_config.json | ModelArgs config for 450M variant. |
| examples/models/lfm2_5_vl/config/lfm2_5_vl_1_6b_config.json | ModelArgs config for 1.6B variant. |
| examples/models/lfm2_5_vl/init.py | Exposes Lfm2p5VlModel and convert_weights from the new package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if attn_options is not None and "conv_states" in attn_options: | ||
| if conv_state is not None: | ||
| conv_state.copy_(new_conv_state) | ||
| states = dict(attn_options["conv_states"]) | ||
| states[self.layer_idx] = new_conv_state | ||
| update["conv_states"] = states |
There was a problem hiding this comment.
When attn_options contains conv_states, this block mutates the provided state via conv_state.copy_(...) but then stores new_conv_state (a freshly allocated tensor from cat) back into the returned conv_states dict. In Transformer._forward_layers, that returned dict is merged into attn_options_, so the next layer call will read a non-static tensor and can break the intended AOTI "static address" state path. Also, dict(attn_options["conv_states"]) will throw if the key exists but the value is None. Consider: (1) reading conv_states = attn_options.get("conv_states") and ensuring it’s a dict before copying, and (2) if conv_state is provided, keep that same tensor in the returned mapping (after the in-place update) rather than replacing it with new_conv_state.
| # Manual depthwise conv — Triton has no template for nn.Conv1d | ||
| # with groups=dim and dynamic sequence length. | ||
| w = self.conv.weight[:, 0, :] | ||
| conv_out = Bx[..., :-2] * w[:, 0:1] + Bx[..., 1:-1] * w[:, 1:2] + Bx[..., 2:] * w[:, 2:3] | ||
|
|
||
| def reset_cache(self): | ||
| self.conv_state.zero_() | ||
| y = self.out_proj((C * conv_out).transpose(-1, -2).contiguous()) |
There was a problem hiding this comment.
ShortConv.forward implements the convolution manually using self.conv.weight, but it ignores self.conv.bias when bias=True. This makes the bias argument silently incorrect. Either add the bias term to conv_out or enforce bias=False (e.g., via an assertion and/or by removing the parameter) to avoid surprising behavior.
| orig = embeddings.position_embedding.weight.data | ||
| sqrt_n = int(math.sqrt(orig.shape[0])) | ||
|
|
||
| grid = orig.reshape(sqrt_n, sqrt_n, -1).permute(2, 0, 1).unsqueeze(0) | ||
| resized = F.interpolate( |
There was a problem hiding this comment.
sqrt_n = int(math.sqrt(orig.shape[0])) is used to reshape positional embeddings into a square grid, but this will silently truncate when orig.shape[0] is not a perfect square and then fail or mis-reshape. It would be safer to assert sqrt_n * sqrt_n == orig.shape[0] (or handle the non-square case explicitly) before reshape.
| def image_embedding(self, nchw_pixels: torch.Tensor) -> torch.Tensor: | ||
| """[B, 3, 512, 512] float32 pixels in [0, 255] -> [B, 256, D].""" | ||
| x = (nchw_pixels / 255.0 - 0.5) / 0.5 | ||
|
|
||
| x = x.unfold(2, PATCH_SIZE, PATCH_SIZE).unfold(3, PATCH_SIZE, PATCH_SIZE) | ||
| x = x.permute(0, 2, 3, 4, 5, 1).reshape(1, FIXED_H * FIXED_W, PATCH_SIZE * PATCH_SIZE * 3) | ||
|
|
There was a problem hiding this comment.
image_embedding hard-codes batch size 1 via .reshape(1, ...) and later returns projected.reshape(1, ...), but the docstring and type hints imply it supports [B, ...]. If batch size is intentionally fixed to 1, consider asserting nchw_pixels.shape[0] == 1 and updating the docstring; otherwise, preserve B through the reshapes so the method behaves correctly for B>1.
| with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): | ||
| return torch.export._trace._export( | ||
| _Decoder(lfm2.text_model, dim, conv_indices, dtype=dtype, device=device), | ||
| (example_emb, example_pos), | ||
| dynamic_shapes=({1: token_dim}, {0: token_dim}), | ||
| strict=False, | ||
| prefer_deferred_runtime_asserts_over_guards=True, | ||
| ) |
There was a problem hiding this comment.
This uses the private API torch.export._trace._export, which is not stable and may break across PyTorch versions. If possible, prefer the public torch.export.export(...) API; otherwise, consider isolating this behind a small helper with a clear comment/version guard so failures are easier to diagnose when PyTorch internals change.
Summary
Add LFM2.5-VL (450M and 1.6B) as a multi-method PTE with three methods:
vision_encoder,token_embedding, andtext_decoderfor CUDA/AOTI. LFM was not supported via CUDA before. Originally this PR started with XNN, then make it work with CUDA. Unfortunately, I have not got chance to test with XNN. Also, it requires adding ci/unit test probably too for pipeline.Context: On very small <500M model, Llama cpp and executorch both deliver good perforamnce for low latency use case (i.e, vs SGlang, higher overhead framework at concurrency=1). This is the FIRST step to reach towards unification benchmark + rigorous measurement of such overhead.
HW: NVIDIA B300, torch 2.11, CUDA 13.0
Results: 333-400 decode tok/s, 435-454 prefill tok/s via
llama_mainC++ runner.Key implementation details
attn_options["conv_states"]for AOTI compatibility. Before,register_bufferis still used for XNNPack.mark_static_address(same as transformers'StaticCachefor Gemma3) so AOTI can trace it.nn.Conv1d(groups=dim)— Triton has no template for depthwise conv1d with dynamic seq_len (or at least I was not able to get this working correctly). If there is an alternative... I would appreciate pointers on its implementation (Did not find in repo too).Prefill sweep (B300, bf16)
Sample outputs (llama_main)
Test plan
Export (multi-method PTE)
Run with llama_main (single-method PTE)
export_single_method.py
Python inference runner
run_lfm2vl.py
Benchmark
bench_lfm2vl.py
Verification status
Known limitations / future work
torch._inductorprivate API to fix nvcc/Triton PTX mismatch. Fragile; should be fixed upstream in PyTorch (relevant code —_nvcc_arch_as_compile_optionmaps 103→100f, should be 103→103a).exir/emit/_emitter.pycopies CUDA tensor storage to CPU beforectypes.data_ptr()read. This is a general fix, not LFM2.5-VL-specific — should be upstreamed as a standalone PR.getattrpattern:_Decoderand_LlamaCompatModeluseregister_buffer(f"conv_state_{idx}")+getattr(self, f"conv_state_{idx}"). Works but violates the "no dynamic setattr/getattr" style guideline. Could use a list-based approach instead.max_batch_size, consuming significant memory. batch=2048 OOMs during AOTI autotuning.llava_mainintegration: the multi-method PTE (vision_encoder + token_embedding + text_decoder) follows the LLaVA runner pattern but hasn't been tested with the actualllava_mainC++ binary.lm_headweight tying assumption:convert_weights.pyassumeslm_head.weight == tok_embeddings.weight(tied embeddings). If a future checkpoint untied them, the lm_head weights would be silently ignored.