fix(pt): stop plain pt dp test from eager-loading pt_expt custom-op fakes#5542
Conversation
…akes deepmd.pt.infer.deep_eval imports the vesin neighbor list from deepmd.pt_expt.utils (added in deepmodeling#5491). That package __init__ eagerly imported tabulate_ops, which registers fake tensor impls for the compressed tabulate custom ops at import time. On the plain pt (torch.jit) backend without the C++ op library, the pt descriptor fallbacks monkeypatch a plain Python function onto torch.ops.deepmd.<op>, so the bare hasattr guard passes but register_fake raises "operator deepmd::tabulate_fusion_se_a does not exist", crashing `dp test`. Fix: - Drop the eager tabulate_ops import from pt_expt/utils/__init__.py. The only consumer that needs the fakes (the compression entry point) already calls ensure_fake_registered() lazily, so plain pt inference no longer triggers any custom-op registration. - Harden ensure_fake_registered(): guard each op with a real OpOverloadPacket check (_op_exists) instead of bare hasattr, so a monkeypatched plain-function fallback is skipped rather than crashing. Remove the import-time auto-call. Tests (source/tests/pt_expt/utils/test_tabulate_ops_lazy.py): - subprocess import of deepmd.pt.infer.deep_eval asserts tabulate_ops/ comm are not eagerly imported. - ensure_fake_registered() with a monkeypatched plain-function op present must skip it without raising (the exact dp test crash).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesLazy tabulate_ops fake-op registration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
source/tests/pt_expt/utils/test_tabulate_ops_lazy.py (1)
56-60: ⚡ Quick winAdd a timeout to the subprocess invocation.
This regression test can hang indefinitely if the child import blocks; adding a bounded timeout makes CI failure deterministic.
Suggested patch
result = subprocess.run( [sys.executable, "-c", code], capture_output=True, text=True, + timeout=30, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py` around lines 56 - 60, The subprocess.run call in the regression test does not have a timeout parameter, which can cause the test to hang indefinitely if the child import blocks. Add a timeout parameter to the subprocess.run invocation with an appropriate value in seconds to ensure that if the subprocess hangs, the test fails deterministically rather than blocking CI indefinitely. This will make debugging hanging processes much easier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py`:
- Around line 81-108: The test only saves and restores the registration state of
a single operation `qualname`, but ensure_fake_registered() may modify multiple
operations in the tabulate_ops._registered set, causing state leakage between
tests. Replace the current approach with a snapshot-and-restore pattern: save a
deep copy of the entire tabulate_ops._registered set before the try block
(instead of just tracking was_registered for qualname), then restore the
complete snapshot in the finally block by assigning it back to
tabulate_ops._registered. This ensures all operations touched by
ensure_fake_registered() are properly cleaned up regardless of how many
operations the function modifies.
---
Nitpick comments:
In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py`:
- Around line 56-60: The subprocess.run call in the regression test does not
have a timeout parameter, which can cause the test to hang indefinitely if the
child import blocks. Add a timeout parameter to the subprocess.run invocation
with an appropriate value in seconds to ensure that if the subprocess hangs, the
test fails deterministically rather than blocking CI indefinitely. This will
make debugging hanging processes much easier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 54dfdbcb-bf91-4430-8dac-58ba2f2f3273
📒 Files selected for processing (3)
deepmd/pt_expt/utils/__init__.pydeepmd/pt_expt/utils/tabulate_ops.pysource/tests/pt_expt/utils/test_tabulate_ops_lazy.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90faa0a494
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Restore the entire tabulate_ops._registered set in the finally block rather than just the single op under test: ensure_fake_registered() may touch multiple op names, so per-op restore could leak module-global state across tests. Addresses CodeRabbit review on deepmodeling#5542.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5542 +/- ##
=======================================
Coverage 82.21% 82.21%
=======================================
Files 892 892
Lines 101531 101530 -1
Branches 4240 4240
=======================================
+ Hits 83475 83476 +1
+ Misses 16753 16751 -2
Partials 1303 1303 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Restore the entire tabulate_ops._registered set in the finally block rather than just the single op under test: ensure_fake_registered() may touch multiple op names, so per-op restore could leak module-global state across tests. Addresses CodeRabbit review on deepmodeling#5542.
Problem
dp teston the plain pt (torch.jit) backend crashes at import time in environments without the C++ custom op library (libdeepmd_op_pt.so):Root cause
deepmd.pt.infer.deep_evalimports the vesin neighbor list fromdeepmd.pt_expt.utils(added in #5491). That package__init__eagerly importedtabulate_ops, which registers fake tensor impls for the compressed tabulate custom ops at import time.When the C++ op library is absent, the pt descriptor fallbacks (e.g.
deepmd/pt/model/descriptor/se_a.py) monkeypatch a plain Python function ontotorch.ops.deepmd.<op>. That makes the barehasattr(torch.ops.deepmd, "tabulate_fusion_se_a")guard returnTrue, but the op is not a real dispatcher op, soregister_fakeraisesoperator ... does not exist— and_try_register_fakeonly caught the "already registered" RuntimeError, so it propagated and crashed the import.Before #5491 the plain pt path never imported
tabulate_ops, so this never ran.Fix
deepmd/pt_expt/utils/__init__.py: drop the eagertabulate_opsimport. The only consumer that genuinely needs the fakes — the compression entry point (deepmd/pt_expt/entrypoints/compress.py) — already callsensure_fake_registered()lazily, so plain pt inference no longer triggers any custom-op registration.deepmd/pt_expt/utils/tabulate_ops.py: guard each op with a realOpOverloadPacketcheck (_op_exists) instead of barehasattr, so a monkeypatched plain-function fallback is skipped rather than crashing. Remove the import-time auto-call.Tests
source/tests/pt_expt/utils/test_tabulate_ops_lazy.py:deepmd.pt.infer.deep_evalassertstabulate_ops/commare not eagerly imported (guards the regression).ensure_fake_registered()with a monkeypatched plain-function op present must skip it without raising (the exactdp testcrash).Both verified to fail against the pre-fix code and pass with the fix. Full
source/tests/pt_expt/utils/suite (90 tests) passes; compression path (_op_exists→ real op → fakes registered) confirmed unchanged.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests