[TRTLLM-12507][feat] Multi-lora for routed experts by brb-nv · Pull Request #14763 · NVIDIA/TensorRT-LLM

brb-nv · 2026-05-29T21:18:37Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Introduces two pure-Python helper modules used by the routed-expert MoE LoRA path, plus their unit tests. No kernel or runtime changes. * `tensorrt_llm/_torch/peft/lora/moe_layout.py` -- normalizes per-expert LoRA adapter layouts (shared-outer vs per-expert, fc1/gated/fc2 arrangement) into the strided/expanded buffers the fused MoE kernel consumes. Pure tensor manipulation; depends only on `torch`. * `tensorrt_llm/_torch/peft/lora/validation.py` -- validates that a LoraConfig + adapter directory contains the right module set / rank / shape combination for routed-expert MoE before the engine attaches it. Depends only on `tensorrt_llm.lora_helper.LoraConfig`. The helpers are landed first so the kernel-integration MR can reference them as a dependency. They have no runtime callers in this MR; the call sites are introduced in the follow-up "feat: add routed-expert LoRA support to fused MoE (eager)" MR. Tests run on CPU only, no GPU dependency. Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

MVP scope: - Cutlass MoE backend only; bf16/fp16 activations, unquantized weights. - Per-expert and shared-outer adapters (shared-outer via load-time replication; native kernel flag is a Phase 6 follow-up). - Multi-LoRA with per-request adapter dispatch. - Slot-indexed input layout for CUDA-graph compatibility is plumbed end-to-end. Graph capture itself is rejected at the op boundary until the kernel-side host-sync removal lands (see F7 in the dev notes); slot-indexed eager execution is supported and tested. C++ thop op (cpp/tensorrt_llm/thop/moeOp.cpp): - New optional kwargs on runMoe: per-request and slot-indexed (fc1, fc2, gated) lora_ranks / lora_weight_ptrs, plus host_request_types, host_context_lengths, token_to_slot, and lora_max_low_rank. - LoraImpl cache keyed on (hidden, inter, dtype, max_rank); cublas wrapper and the kernel's memcpy event are constructed lazily. - buildMoeLoraParams dispatches between per-request and slot-indexed. - Workspace extended to allocate the cuBLAS grouped-gemm scratch. - Rejections: NVFP4 base, min-latency mode, alltoall, mixed per-request + slot-indexed inputs, and CUDA-graph capture under isCapturing(stream). Python wiring: - torch_custom_ops.py: fused_moe schema extended with both LoRA modes. - fused_moe_cutlass.py: CutlassFusedMoE pulls per-layer lora_params and packs them via _extract_moe_lora_tensors; CUDA-graph branch reads stable pinned-host tables from CudaGraphLoraParams. - modeling_qwen_moe.py threads lora_params into the routed-expert call. - create_moe.py rejects MoE LoRA on non-Cutlass backends and with quantization via the new validator. LoRA infrastructure: - peft/lora/validation.py: MoE-LoRA target validator. - peft/lora/moe_layout.py: per-expert and shared-outer adapter helpers for tests. - peft/lora/cuda_graph_lora_params.py: token_to_slot_host buffer and get_moe_slot_inputs accessor (stable per-(layer, module) tables). - peft/lora/layer.py: MoeLoraLayer sentinel for CudaGraphLoraManager. Tests (tests/unittest/_torch/lora/): - CPU: test_moe_lora_validator.py, test_moe_layout.py; extended test_lora.py with CudaGraphLoraParams MoE coverage. - GPU: test_moe_lora_op.py - per-expert, shared-outer, min-latency rejection, slot-indexed-matches-per-request, multi-LoRA mixed batch, incomplete-input and mode-mixing rejections. The CUDA-graph capture+replay test is skipped pending the Phase 6 kernel patch. Docs: - docs/source/features/lora.md: new Routed-Expert MoE LoRA section. - docs/source/_dev_notes/moe-lora-preflight.md: internal design memo (verified facts F1-F7 and per-phase implementation log). Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

Replaces load-time replication of shared-outer MoE LoRA adapters with a kernel-side offset gate. When a side is marked shared, setupLoraWorkspace zero-offsets that side's per-token pointer, so the adapter is read as a single unreplicated buffer (e.g. A: [rank, in_dim], B: [out_dim, rank]) instead of the default [num_experts, ...] layout. Math is bit-identical to the replicated path; memory is reduced by (num_experts - 1) copies of the shared matrix. This is Phase 6a in docs/source/_dev_notes/moe-lora-preflight.md. The sibling Phase 6b (GPU-side per-token pointer expansion that lifts the host-sync constraint and unblocks CUDA-graph capture) remains a separate follow-up; the related test stays skipped with an updated reference. Kernel (cpp/tensorrt_llm/kernels/cutlass_kernels/...): - Six new bool fields on LoraParams (default false): fc1_shared_a/b, fc2_shared_a/b, gated_shared_a/b. Mirrored in the internal Cutlass header. - setupLoraWorkspace reads the flags and gates the `weight_index * dim * lora_rank` offset for each (module, side) pair. Behavior with all flags false is identical to the previous code. Op (cpp/tensorrt_llm/thop/moeOp.cpp): - Six new optional bool kwargs on runMoe, threaded through buildMoeLoraParams to the returned LoraParams. Python (tensorrt_llm/_torch/): - custom_ops/torch_custom_ops.py: fused_moe schema (real + register_fake) extended with the six bool kwargs. - modules/fused_moe/fused_moe_cutlass.py: _resolve_moe_shared_flags reads lora_params["moe_shared_flags"] and threads the dict through both _extract_moe_lora_tensors paths (per-request and CUDA-graph). - peft/lora/moe_layout.py: new make_native_shared_lora helper returning unreplicated shapes plus matching shared flags; expand_native_shared_for_reference broadcasts back to [E, ...] for the eager reference. Tests (tests/unittest/_torch/lora/): - test_moe_layout.py: native_shared shapes, seed reproducibility, expand-for-reference equivalence, and a fp32 bit-identity check between the native-shared reference delta and the replicated one. - test_moe_lora_op.py: - test_moe_native_shared_outer_matches_replicated_bitidentical: runs the kernel twice (native with fc1_shared_a=True, fc2_shared_b=True, gated_shared_a=True vs. replicated baseline with all flags false) on the same logical adapter and asserts rtol=0, atol=0 on the output. - test_moe_native_shared_outer_differs_from_no_lora: sanity smoke for the native path alone (no replication on any side). Docs: - features/lora.md: user-facing section now describes both encodings (load-time replication and native shared-outer) and how to opt in. - _dev_notes/moe-lora-preflight.md: Phase 6a marked done with the full implementation log; Phase 6b kept as a follow-up. Loader plumbing (lora_layout.json sidecar parser in lora_manager.py) deferred to a separate commit -- the kernel and op already accept native adapters via the boolean flags; the loader change is the missing piece for end-to-end model wiring. Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

…ion tests Adds an external-reference correctness suite for the fused MoE LoRA path, complementing the existing same-kernel comparison tests in this module. The earlier tests (device-vs-host, slot-vs-per-request, native-vs-replicated) put the same kernel on both sides of the diff and so cannot detect a reorder or mis-application of LoRA deltas; these new tests put the op output against a hand-written PyTorch fp32 reference. `reference_swiglu_moe_lora` in `tensorrt_llm/_torch/peft/lora/moe_layout.py` is the reference: top-k-routed SwiGLU MoE with three independent LoRA modules (moe_h_to_4h / moe_gate / moe_4h_to_h), all computed in fp32 and then aggregated with topk-score weighting. Tests added in `tests/unittest/_torch/lora/test_moe_lora_op.py`: * test_moe_lora_eager_matches_pytorch_reference: reference-parity baseline with all three LoRA modules active and distinct. * test_moe_no_lora_eager_matches_pytorch_reference: probe with no LoRA at all, confirming the base FC1+SwiGLU+FC2+finalize pipeline is healthy. * test_moe_zero_lora_eager_matches_pytorch_reference: probe with the LoRA pipeline fully active but adapter weights identically zero, isolating the pointer-expand / setupLoraWorkspace plumbing from the LoRA delta computation itself. * test_moe_only_{fc1,gated,fc2}_lora_eager_matches_pytorch_reference: per-module probes activating exactly one LoRA module. * test_moe_pair_{fc1_gated,fc1_fc2,gated_fc2}_distinct_lora_eager_matches_pytorch_reference: pairwise probes with two modules active and the third zero. * test_moe_all_three_lora_with_gated_aliased_to_fc1_eager_matches_pytorch_reference: alias probe with all three modules active and gated aliased to fc1. LoRA adapters are scaled by `_LORA_REFERENCE_SCALE = 0.25` (applied to both A and B, so dW = B@A scales as the square) so legitimate output magnitudes stay in the O(1)-O(10) range at the test shape. Without this scaling, N(0,1) adapter draws produce LoRA deltas many orders of magnitude larger than the base weights and the SwiGLU+FC2 path amplifies the intermediates to magnitudes ~1e5; at that scale bf16 reduction-order noise on a small fraction of lanes routinely exceeds an `atol=1.0` budget without there being any kernel correctness bug. Each test prints op_max_mag / ref_max_mag / max_abs_diff / rel_err on stdout (captured by pytest on failure), and uses `torch.testing.assert_close(..., rtol=5e-2, atol=1.0)` as the PASS/FAIL gate. Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv added 4 commits May 29, 2026 12:34

github-actions Bot assigned brb-nv May 29, 2026

brb-nv mentioned this pull request May 29, 2026

[TRTLLM-12507][feat] Add MoE LoRA layout and validation helpers #14764

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-12507][feat] Multi-lora for routed experts#14763

[TRTLLM-12507][feat] Multi-lora for routed experts#14763
brb-nv wants to merge 4 commits into
NVIDIA:mainfrom
brb-nv:user/brb/moe-lora-feature

brb-nv commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brb-nv commented May 29, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant