Skip to content

[TRTLLM-12507][feat] Multi-lora for routed experts#14763

Draft
brb-nv wants to merge 4 commits into
NVIDIA:mainfrom
brb-nv:user/brb/moe-lora-feature
Draft

[TRTLLM-12507][feat] Multi-lora for routed experts#14763
brb-nv wants to merge 4 commits into
NVIDIA:mainfrom
brb-nv:user/brb/moe-lora-feature

Conversation

@brb-nv
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv commented May 29, 2026

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

brb-nv added 4 commits May 29, 2026 12:34
Introduces two pure-Python helper modules used by the routed-expert MoE
LoRA path, plus their unit tests. No kernel or runtime changes.

* `tensorrt_llm/_torch/peft/lora/moe_layout.py` -- normalizes per-expert
  LoRA adapter layouts (shared-outer vs per-expert, fc1/gated/fc2
  arrangement) into the strided/expanded buffers the fused MoE kernel
  consumes. Pure tensor manipulation; depends only on `torch`.

* `tensorrt_llm/_torch/peft/lora/validation.py` -- validates that a
  LoraConfig + adapter directory contains the right module set / rank /
  shape combination for routed-expert MoE before the engine attaches it.
  Depends only on `tensorrt_llm.lora_helper.LoraConfig`.

The helpers are landed first so the kernel-integration MR can reference
them as a dependency. They have no runtime callers in this MR; the call
sites are introduced in the follow-up "feat: add routed-expert LoRA
support to fused MoE (eager)" MR.

Tests run on CPU only, no GPU dependency.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
MVP scope:
- Cutlass MoE backend only; bf16/fp16 activations, unquantized weights.
- Per-expert and shared-outer adapters (shared-outer via load-time
  replication; native kernel flag is a Phase 6 follow-up).
- Multi-LoRA with per-request adapter dispatch.
- Slot-indexed input layout for CUDA-graph compatibility is plumbed
  end-to-end. Graph capture itself is rejected at the op boundary until
  the kernel-side host-sync removal lands (see F7 in the dev notes);
  slot-indexed eager execution is supported and tested.

C++ thop op (cpp/tensorrt_llm/thop/moeOp.cpp):
- New optional kwargs on runMoe: per-request and slot-indexed (fc1, fc2,
  gated) lora_ranks / lora_weight_ptrs, plus host_request_types,
  host_context_lengths, token_to_slot, and lora_max_low_rank.
- LoraImpl cache keyed on (hidden, inter, dtype, max_rank); cublas
  wrapper and the kernel's memcpy event are constructed lazily.
- buildMoeLoraParams dispatches between per-request and slot-indexed.
- Workspace extended to allocate the cuBLAS grouped-gemm scratch.
- Rejections: NVFP4 base, min-latency mode, alltoall, mixed per-request
  + slot-indexed inputs, and CUDA-graph capture under isCapturing(stream).

Python wiring:
- torch_custom_ops.py: fused_moe schema extended with both LoRA modes.
- fused_moe_cutlass.py: CutlassFusedMoE pulls per-layer lora_params and
  packs them via _extract_moe_lora_tensors; CUDA-graph branch reads
  stable pinned-host tables from CudaGraphLoraParams.
- modeling_qwen_moe.py threads lora_params into the routed-expert call.
- create_moe.py rejects MoE LoRA on non-Cutlass backends and with
  quantization via the new validator.

LoRA infrastructure:
- peft/lora/validation.py: MoE-LoRA target validator.
- peft/lora/moe_layout.py: per-expert and shared-outer adapter helpers
  for tests.
- peft/lora/cuda_graph_lora_params.py: token_to_slot_host buffer and
  get_moe_slot_inputs accessor (stable per-(layer, module) tables).
- peft/lora/layer.py: MoeLoraLayer sentinel for CudaGraphLoraManager.

Tests (tests/unittest/_torch/lora/):
- CPU: test_moe_lora_validator.py, test_moe_layout.py; extended
  test_lora.py with CudaGraphLoraParams MoE coverage.
- GPU: test_moe_lora_op.py - per-expert, shared-outer, min-latency
  rejection, slot-indexed-matches-per-request, multi-LoRA mixed batch,
  incomplete-input and mode-mixing rejections. The CUDA-graph
  capture+replay test is skipped pending the Phase 6 kernel patch.

Docs:
- docs/source/features/lora.md: new Routed-Expert MoE LoRA section.
- docs/source/_dev_notes/moe-lora-preflight.md: internal design memo
  (verified facts F1-F7 and per-phase implementation log).

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Replaces load-time replication of shared-outer MoE LoRA adapters with a
kernel-side offset gate. When a side is marked shared, setupLoraWorkspace
zero-offsets that side's per-token pointer, so the adapter is read as a
single unreplicated buffer (e.g. A: [rank, in_dim], B: [out_dim, rank])
instead of the default [num_experts, ...] layout. Math is bit-identical
to the replicated path; memory is reduced by (num_experts - 1) copies of
the shared matrix.

This is Phase 6a in docs/source/_dev_notes/moe-lora-preflight.md. The
sibling Phase 6b (GPU-side per-token pointer expansion that lifts the
host-sync constraint and unblocks CUDA-graph capture) remains a separate
follow-up; the related test stays skipped with an updated reference.

Kernel (cpp/tensorrt_llm/kernels/cutlass_kernels/...):
- Six new bool fields on LoraParams (default false): fc1_shared_a/b,
  fc2_shared_a/b, gated_shared_a/b. Mirrored in the internal Cutlass
  header.
- setupLoraWorkspace reads the flags and gates the
  `weight_index * dim * lora_rank` offset for each (module, side) pair.
  Behavior with all flags false is identical to the previous code.

Op (cpp/tensorrt_llm/thop/moeOp.cpp):
- Six new optional bool kwargs on runMoe, threaded through
  buildMoeLoraParams to the returned LoraParams.

Python (tensorrt_llm/_torch/):
- custom_ops/torch_custom_ops.py: fused_moe schema (real + register_fake)
  extended with the six bool kwargs.
- modules/fused_moe/fused_moe_cutlass.py: _resolve_moe_shared_flags
  reads lora_params["moe_shared_flags"] and threads the dict through
  both _extract_moe_lora_tensors paths (per-request and CUDA-graph).
- peft/lora/moe_layout.py: new make_native_shared_lora helper returning
  unreplicated shapes plus matching shared flags;
  expand_native_shared_for_reference broadcasts back to [E, ...] for the
  eager reference.

Tests (tests/unittest/_torch/lora/):
- test_moe_layout.py: native_shared shapes, seed reproducibility,
  expand-for-reference equivalence, and a fp32 bit-identity check
  between the native-shared reference delta and the replicated one.
- test_moe_lora_op.py:
  - test_moe_native_shared_outer_matches_replicated_bitidentical:
    runs the kernel twice (native with fc1_shared_a=True,
    fc2_shared_b=True, gated_shared_a=True vs. replicated baseline with
    all flags false) on the same logical adapter and asserts
    rtol=0, atol=0 on the output.
  - test_moe_native_shared_outer_differs_from_no_lora: sanity smoke for
    the native path alone (no replication on any side).

Docs:
- features/lora.md: user-facing section now describes both encodings
  (load-time replication and native shared-outer) and how to opt in.
- _dev_notes/moe-lora-preflight.md: Phase 6a marked done with the full
  implementation log; Phase 6b kept as a follow-up. Loader plumbing
  (lora_layout.json sidecar parser in lora_manager.py) deferred to a
  separate commit -- the kernel and op already accept native adapters
  via the boolean flags; the loader change is the missing piece for
  end-to-end model wiring.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
…ion tests

Adds an external-reference correctness suite for the fused MoE LoRA path,
complementing the existing same-kernel comparison tests in this module.
The earlier tests (device-vs-host, slot-vs-per-request, native-vs-replicated)
put the same kernel on both sides of the diff and so cannot detect a
reorder or mis-application of LoRA deltas; these new tests put the op
output against a hand-written PyTorch fp32 reference.

`reference_swiglu_moe_lora` in `tensorrt_llm/_torch/peft/lora/moe_layout.py`
is the reference: top-k-routed SwiGLU MoE with three independent LoRA
modules (moe_h_to_4h / moe_gate / moe_4h_to_h), all computed in fp32
and then aggregated with topk-score weighting.

Tests added in `tests/unittest/_torch/lora/test_moe_lora_op.py`:
* test_moe_lora_eager_matches_pytorch_reference: reference-parity baseline
  with all three LoRA modules active and distinct.
* test_moe_no_lora_eager_matches_pytorch_reference: probe with no LoRA at
  all, confirming the base FC1+SwiGLU+FC2+finalize pipeline is healthy.
* test_moe_zero_lora_eager_matches_pytorch_reference: probe with the LoRA
  pipeline fully active but adapter weights identically zero, isolating
  the pointer-expand / setupLoraWorkspace plumbing from the LoRA delta
  computation itself.
* test_moe_only_{fc1,gated,fc2}_lora_eager_matches_pytorch_reference:
  per-module probes activating exactly one LoRA module.
* test_moe_pair_{fc1_gated,fc1_fc2,gated_fc2}_distinct_lora_eager_matches_pytorch_reference:
  pairwise probes with two modules active and the third zero.
* test_moe_all_three_lora_with_gated_aliased_to_fc1_eager_matches_pytorch_reference:
  alias probe with all three modules active and gated aliased to fc1.

LoRA adapters are scaled by `_LORA_REFERENCE_SCALE = 0.25` (applied to
both A and B, so dW = B@A scales as the square) so legitimate output
magnitudes stay in the O(1)-O(10) range at the test shape. Without this
scaling, N(0,1) adapter draws produce LoRA deltas many orders of magnitude
larger than the base weights and the SwiGLU+FC2 path amplifies the
intermediates to magnitudes ~1e5; at that scale bf16 reduction-order noise
on a small fraction of lanes routinely exceeds an `atol=1.0` budget without
there being any kernel correctness bug.

Each test prints op_max_mag / ref_max_mag / max_abs_diff / rel_err on
stdout (captured by pytest on failure), and uses
`torch.testing.assert_close(..., rtol=5e-2, atol=1.0)` as the PASS/FAIL
gate.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant