Skip to content

fix(pt): HybridMuon ZeRO checkpoint slowdown#5525

Merged
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/muon
Jun 16, 2026
Merged

fix(pt): HybridMuon ZeRO checkpoint slowdown#5525
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/muon

Conversation

@OutisLi

@OutisLi OutisLi commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Keep HybridMuon name-based routing metadata out of optimizer param_groups.
  • Inject runtime parameter names after optimizer construction, including the ZeRO-1 inner optimizer path.
  • Prevent ZeroRedundancyOptimizer.consolidate_state_dict() from gathering a duplicate model-sized named_parameters object graph.

Test plan

  • Ran 2-GPU ZeRO-1 short DPA4/SeZM training with HybridMuon.
  • Verified checkpoint save time dropped from ~8.5s to ~3.0s in the short test.
  • Verified generated checkpoints no longer contain optimizer.param_groups[0]["named_parameters"].
  • Verified restart from the patched checkpoint continues training and saves a valid checkpoint.
  • Checked linter diagnostics for modified files.

Summary by CodeRabbit

  • Refactor
    • Restructured parameter name registration in the HybridMuon optimizer for improved flexibility during training setup.

@OutisLi OutisLi requested a review from njzjz June 13, 2026 03:57
@dosubot dosubot Bot added the bug label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15bb779a-3a92-41f6-af16-edc76b6727bb

📥 Commits

Reviewing files that changed from the base of the PR and between 5d94bd6 and 7fc5ea5.

📒 Files selected for processing (2)
  • deepmd/pt/optimizer/hybrid_muon.py
  • deepmd/pt/train/training.py

📝 Walkthrough

Walkthrough

The PR refactors parameter naming registration in HybridMuonOptimizer by extracting it into a deferred set_param_names() method. Training code now captures named parameters once, removes the inline passing to the optimizer constructor, and applies names after instantiation using the new method.

Changes

HybridMuon Parameter Naming Refactor

Layer / File(s) Summary
HybridMuonOptimizer set_param_names method
deepmd/pt/optimizer/hybrid_muon.py
New set_param_names() method builds id(param) -> name mapping and resets _routing_built; __init__ delegates to this method instead of inline logic.
Training parameter registration flow
deepmd/pt/train/training.py
Captures named parameters from wrapper, removes the named_parameters kwarg passed to optimizer construction, and calls set_param_names() post-instantiation on the correct optimizer based on zero_stage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: fixing a HybridMuon ZeRO checkpoint slowdown by reorganizing how parameter names are handled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.19%. Comparing base (5d94bd6) to head (7fc5ea5).

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5525      +/-   ##
==========================================
- Coverage   82.19%   82.19%   -0.01%     
==========================================
  Files         891      891              
  Lines      101599   101605       +6     
  Branches     4242     4240       -2     
==========================================
+ Hits        83507    83510       +3     
- Misses      16789    16791       +2     
- Partials     1303     1304       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@OutisLi OutisLi added this pull request to the merge queue Jun 16, 2026
@njzjz njzjz removed this pull request from the merge queue due to the queue being cleared Jun 16, 2026
@njzjz njzjz added this pull request to the merge queue Jun 16, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 16, 2026
@njzjz njzjz added this pull request to the merge queue Jun 16, 2026
Merged via the queue into deepmodeling:master with commit 0dbb2a7 Jun 16, 2026
73 checks passed
@OutisLi OutisLi deleted the pr/muon branch June 18, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants