[TRTLLM-12507][feat] Multi-lora for routed experts#14763
Draft
brb-nv wants to merge 4 commits into
Draft
Conversation
Introduces two pure-Python helper modules used by the routed-expert MoE LoRA path, plus their unit tests. No kernel or runtime changes. * `tensorrt_llm/_torch/peft/lora/moe_layout.py` -- normalizes per-expert LoRA adapter layouts (shared-outer vs per-expert, fc1/gated/fc2 arrangement) into the strided/expanded buffers the fused MoE kernel consumes. Pure tensor manipulation; depends only on `torch`. * `tensorrt_llm/_torch/peft/lora/validation.py` -- validates that a LoraConfig + adapter directory contains the right module set / rank / shape combination for routed-expert MoE before the engine attaches it. Depends only on `tensorrt_llm.lora_helper.LoraConfig`. The helpers are landed first so the kernel-integration MR can reference them as a dependency. They have no runtime callers in this MR; the call sites are introduced in the follow-up "feat: add routed-expert LoRA support to fused MoE (eager)" MR. Tests run on CPU only, no GPU dependency. Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
MVP scope: - Cutlass MoE backend only; bf16/fp16 activations, unquantized weights. - Per-expert and shared-outer adapters (shared-outer via load-time replication; native kernel flag is a Phase 6 follow-up). - Multi-LoRA with per-request adapter dispatch. - Slot-indexed input layout for CUDA-graph compatibility is plumbed end-to-end. Graph capture itself is rejected at the op boundary until the kernel-side host-sync removal lands (see F7 in the dev notes); slot-indexed eager execution is supported and tested. C++ thop op (cpp/tensorrt_llm/thop/moeOp.cpp): - New optional kwargs on runMoe: per-request and slot-indexed (fc1, fc2, gated) lora_ranks / lora_weight_ptrs, plus host_request_types, host_context_lengths, token_to_slot, and lora_max_low_rank. - LoraImpl cache keyed on (hidden, inter, dtype, max_rank); cublas wrapper and the kernel's memcpy event are constructed lazily. - buildMoeLoraParams dispatches between per-request and slot-indexed. - Workspace extended to allocate the cuBLAS grouped-gemm scratch. - Rejections: NVFP4 base, min-latency mode, alltoall, mixed per-request + slot-indexed inputs, and CUDA-graph capture under isCapturing(stream). Python wiring: - torch_custom_ops.py: fused_moe schema extended with both LoRA modes. - fused_moe_cutlass.py: CutlassFusedMoE pulls per-layer lora_params and packs them via _extract_moe_lora_tensors; CUDA-graph branch reads stable pinned-host tables from CudaGraphLoraParams. - modeling_qwen_moe.py threads lora_params into the routed-expert call. - create_moe.py rejects MoE LoRA on non-Cutlass backends and with quantization via the new validator. LoRA infrastructure: - peft/lora/validation.py: MoE-LoRA target validator. - peft/lora/moe_layout.py: per-expert and shared-outer adapter helpers for tests. - peft/lora/cuda_graph_lora_params.py: token_to_slot_host buffer and get_moe_slot_inputs accessor (stable per-(layer, module) tables). - peft/lora/layer.py: MoeLoraLayer sentinel for CudaGraphLoraManager. Tests (tests/unittest/_torch/lora/): - CPU: test_moe_lora_validator.py, test_moe_layout.py; extended test_lora.py with CudaGraphLoraParams MoE coverage. - GPU: test_moe_lora_op.py - per-expert, shared-outer, min-latency rejection, slot-indexed-matches-per-request, multi-LoRA mixed batch, incomplete-input and mode-mixing rejections. The CUDA-graph capture+replay test is skipped pending the Phase 6 kernel patch. Docs: - docs/source/features/lora.md: new Routed-Expert MoE LoRA section. - docs/source/_dev_notes/moe-lora-preflight.md: internal design memo (verified facts F1-F7 and per-phase implementation log). Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Replaces load-time replication of shared-outer MoE LoRA adapters with a
kernel-side offset gate. When a side is marked shared, setupLoraWorkspace
zero-offsets that side's per-token pointer, so the adapter is read as a
single unreplicated buffer (e.g. A: [rank, in_dim], B: [out_dim, rank])
instead of the default [num_experts, ...] layout. Math is bit-identical
to the replicated path; memory is reduced by (num_experts - 1) copies of
the shared matrix.
This is Phase 6a in docs/source/_dev_notes/moe-lora-preflight.md. The
sibling Phase 6b (GPU-side per-token pointer expansion that lifts the
host-sync constraint and unblocks CUDA-graph capture) remains a separate
follow-up; the related test stays skipped with an updated reference.
Kernel (cpp/tensorrt_llm/kernels/cutlass_kernels/...):
- Six new bool fields on LoraParams (default false): fc1_shared_a/b,
fc2_shared_a/b, gated_shared_a/b. Mirrored in the internal Cutlass
header.
- setupLoraWorkspace reads the flags and gates the
`weight_index * dim * lora_rank` offset for each (module, side) pair.
Behavior with all flags false is identical to the previous code.
Op (cpp/tensorrt_llm/thop/moeOp.cpp):
- Six new optional bool kwargs on runMoe, threaded through
buildMoeLoraParams to the returned LoraParams.
Python (tensorrt_llm/_torch/):
- custom_ops/torch_custom_ops.py: fused_moe schema (real + register_fake)
extended with the six bool kwargs.
- modules/fused_moe/fused_moe_cutlass.py: _resolve_moe_shared_flags
reads lora_params["moe_shared_flags"] and threads the dict through
both _extract_moe_lora_tensors paths (per-request and CUDA-graph).
- peft/lora/moe_layout.py: new make_native_shared_lora helper returning
unreplicated shapes plus matching shared flags;
expand_native_shared_for_reference broadcasts back to [E, ...] for the
eager reference.
Tests (tests/unittest/_torch/lora/):
- test_moe_layout.py: native_shared shapes, seed reproducibility,
expand-for-reference equivalence, and a fp32 bit-identity check
between the native-shared reference delta and the replicated one.
- test_moe_lora_op.py:
- test_moe_native_shared_outer_matches_replicated_bitidentical:
runs the kernel twice (native with fc1_shared_a=True,
fc2_shared_b=True, gated_shared_a=True vs. replicated baseline with
all flags false) on the same logical adapter and asserts
rtol=0, atol=0 on the output.
- test_moe_native_shared_outer_differs_from_no_lora: sanity smoke for
the native path alone (no replication on any side).
Docs:
- features/lora.md: user-facing section now describes both encodings
(load-time replication and native shared-outer) and how to opt in.
- _dev_notes/moe-lora-preflight.md: Phase 6a marked done with the full
implementation log; Phase 6b kept as a follow-up. Loader plumbing
(lora_layout.json sidecar parser in lora_manager.py) deferred to a
separate commit -- the kernel and op already accept native adapters
via the boolean flags; the loader change is the missing piece for
end-to-end model wiring.
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
…ion tests
Adds an external-reference correctness suite for the fused MoE LoRA path,
complementing the existing same-kernel comparison tests in this module.
The earlier tests (device-vs-host, slot-vs-per-request, native-vs-replicated)
put the same kernel on both sides of the diff and so cannot detect a
reorder or mis-application of LoRA deltas; these new tests put the op
output against a hand-written PyTorch fp32 reference.
`reference_swiglu_moe_lora` in `tensorrt_llm/_torch/peft/lora/moe_layout.py`
is the reference: top-k-routed SwiGLU MoE with three independent LoRA
modules (moe_h_to_4h / moe_gate / moe_4h_to_h), all computed in fp32
and then aggregated with topk-score weighting.
Tests added in `tests/unittest/_torch/lora/test_moe_lora_op.py`:
* test_moe_lora_eager_matches_pytorch_reference: reference-parity baseline
with all three LoRA modules active and distinct.
* test_moe_no_lora_eager_matches_pytorch_reference: probe with no LoRA at
all, confirming the base FC1+SwiGLU+FC2+finalize pipeline is healthy.
* test_moe_zero_lora_eager_matches_pytorch_reference: probe with the LoRA
pipeline fully active but adapter weights identically zero, isolating
the pointer-expand / setupLoraWorkspace plumbing from the LoRA delta
computation itself.
* test_moe_only_{fc1,gated,fc2}_lora_eager_matches_pytorch_reference:
per-module probes activating exactly one LoRA module.
* test_moe_pair_{fc1_gated,fc1_fc2,gated_fc2}_distinct_lora_eager_matches_pytorch_reference:
pairwise probes with two modules active and the third zero.
* test_moe_all_three_lora_with_gated_aliased_to_fc1_eager_matches_pytorch_reference:
alias probe with all three modules active and gated aliased to fc1.
LoRA adapters are scaled by `_LORA_REFERENCE_SCALE = 0.25` (applied to
both A and B, so dW = B@A scales as the square) so legitimate output
magnitudes stay in the O(1)-O(10) range at the test shape. Without this
scaling, N(0,1) adapter draws produce LoRA deltas many orders of magnitude
larger than the base weights and the SwiGLU+FC2 path amplifies the
intermediates to magnitudes ~1e5; at that scale bf16 reduction-order noise
on a small fraction of lanes routinely exceeds an `atol=1.0` budget without
there being any kernel correctness bug.
Each test prints op_max_mag / ref_max_mag / max_abs_diff / rel_err on
stdout (captured by pytest on failure), and uses
`torch.testing.assert_close(..., rtol=5e-2, atol=1.0)` as the PASS/FAIL
gate.
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.