Fix Double Application of Softmax for Router Logits in MoE models by ionut-anghelina · Pull Request #45346 · huggingface/transformers

ionut-anghelina · 2026-04-09T14:48:28Z

No description provided.

Several MoE routers applied softmax to raw logits inside forward() but returned the result as `router_logits`. The load_balancing_loss_func then applied softmax again, computing the aux loss on softmax(softmax(logits)) which flattens the distribution toward uniform, rendering the load-balancing loss ineffective. Fix: use a separate `router_probs` variable for the softmaxed values used in top-k routing, keeping `router_logits` as raw logits so the loss function's single softmax is correct. Source modular files fixed: - mixtral/modular_mixtral.py (MixtralTopKRouter) - qwen2_moe/modular_qwen2_moe.py (Qwen2MoeTopKRouter) - qwen3_vl_moe/modular_qwen3_vl_moe.py (Qwen3VLMoeTextTopKRouter) Downstream models regenerated by make fix-repo: mixtral, minimax, qwen2_moe, olmoe, flex_olmo, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe, qwen3_5_moe Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add regression tests in mixtral and qwen2_moe to verify router_logits are raw logits (not softmax probabilities) - Fix .to() dtype cast to use router_logits.dtype (model dtype) instead of router_probs.dtype (float32) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-09T14:49:41Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: flex_olmo, minimax, mixtral, olmoe, qwen2_moe, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

vasqu · 2026-04-09T15:48:05Z

tests/models/mixtral/test_modeling_mixtral.py

@@ -89,6 +89,14 @@ def test_load_balancing_loss(self):
        self.assertEqual(result.router_logits[0].shape, (91, config.num_local_experts))
        torch.testing.assert_close(result.aux_loss.cpu(), torch.tensor(2, dtype=torch.float32), rtol=1e-2, atol=1e-2)

+        # Verify router_logits are raw logits, not softmax probabilities (regression test for double-softmax bug)


Iirc, we have more appearances of that test in other models. It doesnt hurt to add them to all we have + maybe make it a generalized one in causal lm tester (because we now have ways to properly detect moes with the interface)

vasqu · 2026-04-09T15:49:06Z

@Rocketknight1 I'm not sure about the current state here so just left a comment here since it seemed the most recent state of things. Lmk if not or where I should properly look at

vasqu · 2026-04-09T15:50:47Z

Let's also add the fixes and closes statements for the issue and other PR please

ionut-anghelina and others added 2 commits March 30, 2026 08:18

vasqu reviewed Apr 9, 2026

View reviewed changes

vasqu mentioned this pull request Apr 9, 2026

Fix MoE routers returning probabilities instead of logits #45131

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Double Application of Softmax for Router Logits in MoE models#45346

Fix Double Application of Softmax for Router Logits in MoE models#45346
ionut-anghelina wants to merge 2 commits intohuggingface:mainfrom
ionut-anghelina:dev/ionut/FixDoubleSoftmax

ionut-anghelina commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

vasqu Apr 9, 2026

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ionut-anghelina commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

vasqu Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

vasqu commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants