Skip to content

Add AI written qwen3_moe example#2887

Open
skyw wants to merge 8 commits intoNVIDIA:mainfrom
skyw:vibe_qwen3
Open

Add AI written qwen3_moe example#2887
skyw wants to merge 8 commits intoNVIDIA:mainfrom
skyw:vibe_qwen3

Conversation

@skyw
Copy link
Copy Markdown

@skyw skyw commented Apr 15, 2026

Description

A almost pure TE module implementation of Qwen3 Moe model

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Add Qwen3 MoE model use TE module only
  • Simple test to match HF counterpart.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

skyw added 4 commits April 15, 2026 11:16
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
@ksivaman ksivaman self-requested a review April 15, 2026 18:39
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR adds a new examples/pytorch/qwen3_moe/ directory with a single-GPU Qwen3 MoE implementation built entirely from TransformerEngine modules (te.MultiheadAttention, te.RMSNorm, te_ops.GroupedLinear/SwiGLU, te.moe_permute_with_probs/moe_unpermute) and a forward+backward numerical comparison test against HuggingFace. The architecture mapping is faithful to the HF reference and the TE API is used correctly throughout.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/documentation issues that do not affect runtime behavior.

The model implementation and weight-copy logic are correct for the default configuration (fuse_qkv_params=False). The two P2 findings are a misleading code comment and a dead code branch that is never reached with the current model setup. No P0 or P1 issues remain unaddressed.

examples/pytorch/qwen3_moe/test_vs_hf.py — dead "qkv" weight-copy branch (lines 106-109) should be removed or corrected before fuse_qkv_params=True is ever used.

Important Files Changed

Filename Overview
examples/pytorch/qwen3_moe/config.py Frozen dataclass with HF-compatible Qwen3MoeConfig defaults; clean and straightforward.
examples/pytorch/qwen3_moe/model.py Complete TE module implementation mapping HF Qwen3 MoE to TE equivalents; one misleading comment about CPU sync (P2).
examples/pytorch/qwen3_moe/test_vs_hf.py Forward/backward weight-mapping test; contains a dead "qkv" weight-copy branch with incorrect GQA interleaved layout (P2), plus the already-flagged no-op data.copy_() on backward logits.
examples/pytorch/qwen3_moe/README.md Clear module-mapping table and usage instructions; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["input_ids (B, S)"] --> B["embed_tokens → hidden_states (B, S, H)"]
    B --> C["RotaryPositionEmbedding → freqs"]
    C --> D{"For each DecoderLayer"}
    D --> E["residual = hidden_states"]
    E --> F["te.MultiheadAttention\n(fused LN + QKV + QK-norm + RoPE + attn + O-proj)"]
    F --> G["hidden_states = residual + attn_out"]
    G --> H["residual = hidden_states"]
    H --> I["te.RMSNorm (post_attention_layernorm)"]
    I --> J["Qwen3MoeBlock"]
    subgraph MoE ["Qwen3MoeBlock"]
        J1["hidden_flat (T, H)"] --> J2["Qwen3MoeRouter\n(softmax + top-k)"]
        J2 --> J3["merging_probs, routing_map,\ntokens_per_expert, router_logits"]
        J3 --> J4["te.moe_permute_with_probs\n→ permuted_input (T*k, H)"]
        J4 --> J5["te_ops.Sequential\nGroupedLinear → SwiGLU → GroupedLinear"]
        J5 --> J6["te.moe_unpermute\n→ output (T, H)"]
    end
    J --> J1
    J6 --> K["hidden_states = residual + moe_out"]
    K --> D
    D --> L["te.RMSNorm (final norm)"]
    L --> M["te.Linear (lm_head) → logits (B, S, V)"]
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into vibe_qwen3" | Re-trigger Greptile

Comment thread examples/pytorch/qwen3_moe/test_vs_hf.py Outdated
Comment thread examples/pytorch/qwen3_moe/test_vs_hf.py
skyw and others added 4 commits April 15, 2026 12:30
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant