fix(pt): HybridMuon ZeRO checkpoint slowdown by OutisLi · Pull Request #5525 · deepmodeling/deepmd-kit

OutisLi · 2026-06-13T03:57:18Z

Summary

Keep HybridMuon name-based routing metadata out of optimizer param_groups.
Inject runtime parameter names after optimizer construction, including the ZeRO-1 inner optimizer path.
Prevent ZeroRedundancyOptimizer.consolidate_state_dict() from gathering a duplicate model-sized named_parameters object graph.

Test plan

Ran 2-GPU ZeRO-1 short DPA4/SeZM training with HybridMuon.
Verified checkpoint save time dropped from ~8.5s to ~3.0s in the short test.
Verified generated checkpoints no longer contain optimizer.param_groups[0]["named_parameters"].
Verified restart from the patched checkpoint continues training and saves a valid checkpoint.
Checked linter diagnostics for modified files.

Summary by CodeRabbit

Refactor
- Restructured parameter name registration in the HybridMuon optimizer for improved flexibility during training setup.

coderabbitai · 2026-06-13T04:02:06Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15bb779a-3a92-41f6-af16-edc76b6727bb

📥 Commits

Reviewing files that changed from the base of the PR and between 5d94bd6 and 7fc5ea5.

📒 Files selected for processing (2)

deepmd/pt/optimizer/hybrid_muon.py
deepmd/pt/train/training.py

📝 Walkthrough

Walkthrough

The PR refactors parameter naming registration in HybridMuonOptimizer by extracting it into a deferred set_param_names() method. Training code now captures named parameters once, removes the inline passing to the optimizer constructor, and applies names after instantiation using the new method.

Changes

HybridMuon Parameter Naming Refactor

Layer / File(s)	Summary
HybridMuonOptimizer set_param_names method `deepmd/pt/optimizer/hybrid_muon.py`	New `set_param_names()` method builds `id(param) -> name` mapping and resets `_routing_built`; `__init__` delegates to this method instead of inline logic.
Training parameter registration flow `deepmd/pt/train/training.py`	Captures named parameters from wrapper, removes the `named_parameters` kwarg passed to optimizer construction, and calls `set_param_names()` post-instantiation on the correct optimizer based on `zero_stage`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: fixing a HybridMuon ZeRO checkpoint slowdown by reorganizing how parameter names are handled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-13T04:46:55Z

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.19%. Comparing base (5d94bd6) to head (7fc5ea5).

Files with missing lines	Patch %	Lines
deepmd/pt/train/training.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5525      +/-   ##
==========================================
- Coverage   82.19%   82.19%   -0.01%     
==========================================
  Files         891      891              
  Lines      101599   101605       +6     
  Branches     4242     4240       -2     
==========================================
+ Hits        83507    83510       +3     
- Misses      16789    16791       +2     
- Partials     1303     1304       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fix(pt): HybridMuon ZeRO checkpoint slowdown

7fc5ea5

OutisLi requested a review from njzjz June 13, 2026 03:57

dosubot Bot added the bug label Jun 13, 2026

github-actions Bot added the Python label Jun 13, 2026

njzjz approved these changes Jun 13, 2026

View reviewed changes

OutisLi added this pull request to the merge queue Jun 16, 2026

njzjz removed this pull request from the merge queue due to the queue being cleared Jun 16, 2026

njzjz added this pull request to the merge queue Jun 16, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 16, 2026

njzjz added this pull request to the merge queue Jun 16, 2026

Merged via the queue into deepmodeling:master with commit 0dbb2a7 Jun 16, 2026
73 checks passed

OutisLi deleted the pr/muon branch June 18, 2026 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt): HybridMuon ZeRO checkpoint slowdown#5525

fix(pt): HybridMuon ZeRO checkpoint slowdown#5525
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/muon

OutisLi commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OutisLi commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 13, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OutisLi commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading