Skip to content

fix(pt): stop plain pt dp test from eager-loading pt_expt custom-op fakes#5542

Merged
iProzd merged 2 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-eager-tabulate-import
Jun 17, 2026
Merged

fix(pt): stop plain pt dp test from eager-loading pt_expt custom-op fakes#5542
iProzd merged 2 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-eager-tabulate-import

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Problem

dp test on the plain pt (torch.jit) backend crashes at import time in environments without the C++ custom op library (libdeepmd_op_pt.so):

File ".../deepmd/pt/infer/deep_eval.py", line 77, in <module>
    from deepmd.pt_expt.utils.vesin_neighbor_list import (
File ".../deepmd/pt_expt/utils/__init__.py", line 30, in <module>
    from deepmd.pt_expt.utils import tabulate_ops  # noqa: F401
File ".../deepmd/pt_expt/utils/tabulate_ops.py", line 136, in <module>
    ensure_fake_registered()
...
RuntimeError: operator deepmd::tabulate_fusion_se_a does not exist

Root cause

deepmd.pt.infer.deep_eval imports the vesin neighbor list from deepmd.pt_expt.utils (added in #5491). That package __init__ eagerly imported tabulate_ops, which registers fake tensor impls for the compressed tabulate custom ops at import time.

When the C++ op library is absent, the pt descriptor fallbacks (e.g. deepmd/pt/model/descriptor/se_a.py) monkeypatch a plain Python function onto torch.ops.deepmd.<op>. That makes the bare hasattr(torch.ops.deepmd, "tabulate_fusion_se_a") guard return True, but the op is not a real dispatcher op, so register_fake raises operator ... does not exist — and _try_register_fake only caught the "already registered" RuntimeError, so it propagated and crashed the import.

Before #5491 the plain pt path never imported tabulate_ops, so this never ran.

Fix

  1. deepmd/pt_expt/utils/__init__.py: drop the eager tabulate_ops import. The only consumer that genuinely needs the fakes — the compression entry point (deepmd/pt_expt/entrypoints/compress.py) — already calls ensure_fake_registered() lazily, so plain pt inference no longer triggers any custom-op registration.
  2. deepmd/pt_expt/utils/tabulate_ops.py: guard each op with a real OpOverloadPacket check (_op_exists) instead of bare hasattr, so a monkeypatched plain-function fallback is skipped rather than crashing. Remove the import-time auto-call.

Tests

source/tests/pt_expt/utils/test_tabulate_ops_lazy.py:

  • subprocess import of deepmd.pt.infer.deep_eval asserts tabulate_ops / comm are not eagerly imported (guards the regression).
  • ensure_fake_registered() with a monkeypatched plain-function op present must skip it without raising (the exact dp test crash).

Both verified to fail against the pre-fix code and pass with the fix. Full source/tests/pt_expt/utils/ suite (90 tests) passes; compression path (_op_exists → real op → fakes registered) confirmed unchanged.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved lazy custom operation registration when the C++ custom-op library isn’t available.
    • Improved detection logic to avoid treating Python fallbacks as real registered operations.
    • Reduced import-time side effects by deferring registration until execution paths require it.
  • Tests

    • Added regression coverage to ensure no eager imports occur and that fake-op registration safely skips when real ops are missing.

…akes

deepmd.pt.infer.deep_eval imports the vesin neighbor list from
deepmd.pt_expt.utils (added in deepmodeling#5491). That package __init__ eagerly
imported tabulate_ops, which registers fake tensor impls for the
compressed tabulate custom ops at import time. On the plain pt
(torch.jit) backend without the C++ op library, the pt descriptor
fallbacks monkeypatch a plain Python function onto torch.ops.deepmd.<op>,
so the bare hasattr guard passes but register_fake raises
"operator deepmd::tabulate_fusion_se_a does not exist", crashing
`dp test`.

Fix:
- Drop the eager tabulate_ops import from pt_expt/utils/__init__.py.
  The only consumer that needs the fakes (the compression entry point)
  already calls ensure_fake_registered() lazily, so plain pt inference
  no longer triggers any custom-op registration.
- Harden ensure_fake_registered(): guard each op with a real
  OpOverloadPacket check (_op_exists) instead of bare hasattr, so a
  monkeypatched plain-function fallback is skipped rather than crashing.
  Remove the import-time auto-call.

Tests (source/tests/pt_expt/utils/test_tabulate_ops_lazy.py):
- subprocess import of deepmd.pt.infer.deep_eval asserts tabulate_ops/
  comm are not eagerly imported.
- ensure_fake_registered() with a monkeypatched plain-function op present
  must skip it without raising (the exact dp test crash).
@dosubot dosubot Bot added the bug label Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 14cd2e30-4f5f-4c15-8b7b-55efdd37f0a9

📥 Commits

Reviewing files that changed from the base of the PR and between 90faa0a and 3eb005b.

📒 Files selected for processing (1)
  • source/tests/pt_expt/utils/test_tabulate_ops_lazy.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/tests/pt_expt/utils/test_tabulate_ops_lazy.py

📝 Walkthrough

Walkthrough

deepmd/pt_expt/utils/__init__.py removes the eager import of tabulate_ops, deferring fake-op registration to explicit call sites. tabulate_ops.py adds a _op_exists() helper to distinguish real C++ dispatcher ops from Python fallbacks, updates ensure_fake_registered() to gate on it, and removes the module-level self-call. Two regression tests validate both lazy-import and fallback-skipping behaviors.

Changes

Lazy tabulate_ops fake-op registration

Layer / File(s) Summary
_op_exists helper and ensure_fake_registered gating
deepmd/pt_expt/utils/tabulate_ops.py
Adds _op_exists(name) to detect real torch._ops.OpOverloadPacket dispatcher ops, updates ensure_fake_registered() to use it instead of hasattr() for each fusion op, removes the module-level call to ensure_fake_registered(), and updates the module docstring to document explicit-call semantics.
Remove eager tabulate_ops import from package init
deepmd/pt_expt/utils/__init__.py
Removes the eager tabulate_ops import that triggered fake-op registration at package load time, replacing it with a comment documenting the lazy registration contract.
Regression tests for lazy import and fallback skipping
source/tests/pt_expt/utils/test_tabulate_ops_lazy.py
Adds a subprocess-based test asserting tabulate_ops is not in sys.modules after importing deepmd.pt.infer.deep_eval, and a monkeypatch-based test asserting ensure_fake_registered() skips Python-fallback ops and does not record them in _registered.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • deepmodeling/deepmd-kit#5451: Applies the same lazy ensure_*_registered() pattern to comm.py's border-op registration, making it a direct predecessor to this PR's equivalent refactor for tabulate_ops.

Suggested reviewers

  • njzjz-bot
  • njzjz
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main change: preventing eager loading of custom-op fakes in plain PyTorch DP tests, which is the core issue being fixed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
source/tests/pt_expt/utils/test_tabulate_ops_lazy.py (1)

56-60: ⚡ Quick win

Add a timeout to the subprocess invocation.

This regression test can hang indefinitely if the child import blocks; adding a bounded timeout makes CI failure deterministic.

Suggested patch
     result = subprocess.run(
         [sys.executable, "-c", code],
         capture_output=True,
         text=True,
+        timeout=30,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py` around lines 56 - 60,
The subprocess.run call in the regression test does not have a timeout
parameter, which can cause the test to hang indefinitely if the child import
blocks. Add a timeout parameter to the subprocess.run invocation with an
appropriate value in seconds to ensure that if the subprocess hangs, the test
fails deterministically rather than blocking CI indefinitely. This will make
debugging hanging processes much easier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py`:
- Around line 81-108: The test only saves and restores the registration state of
a single operation `qualname`, but ensure_fake_registered() may modify multiple
operations in the tabulate_ops._registered set, causing state leakage between
tests. Replace the current approach with a snapshot-and-restore pattern: save a
deep copy of the entire tabulate_ops._registered set before the try block
(instead of just tracking was_registered for qualname), then restore the
complete snapshot in the finally block by assigning it back to
tabulate_ops._registered. This ensures all operations touched by
ensure_fake_registered() are properly cleaned up regardless of how many
operations the function modifies.

---

Nitpick comments:
In `@source/tests/pt_expt/utils/test_tabulate_ops_lazy.py`:
- Around line 56-60: The subprocess.run call in the regression test does not
have a timeout parameter, which can cause the test to hang indefinitely if the
child import blocks. Add a timeout parameter to the subprocess.run invocation
with an appropriate value in seconds to ensure that if the subprocess hangs, the
test fails deterministically rather than blocking CI indefinitely. This will
make debugging hanging processes much easier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 54dfdbcb-bf91-4430-8dac-58ba2f2f3273

📥 Commits

Reviewing files that changed from the base of the PR and between d3834e2 and 90faa0a.

📒 Files selected for processing (3)
  • deepmd/pt_expt/utils/__init__.py
  • deepmd/pt_expt/utils/tabulate_ops.py
  • source/tests/pt_expt/utils/test_tabulate_ops_lazy.py

Comment thread source/tests/pt_expt/utils/test_tabulate_ops_lazy.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90faa0a494

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepmd/pt_expt/utils/__init__.py
@wanghan-iapcm wanghan-iapcm requested review from iProzd and njzjz June 16, 2026 07:34
Restore the entire tabulate_ops._registered set in the finally block rather
than just the single op under test: ensure_fake_registered() may touch
multiple op names, so per-op restore could leak module-global state across
tests. Addresses CodeRabbit review on deepmodeling#5542.
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.21%. Comparing base (d3834e2) to head (3eb005b).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5542   +/-   ##
=======================================
  Coverage   82.21%   82.21%           
=======================================
  Files         892      892           
  Lines      101531   101530    -1     
  Branches     4240     4240           
=======================================
+ Hits        83475    83476    +1     
+ Misses      16753    16751    -2     
  Partials     1303     1303           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zhaiwenxi pushed a commit to zhaiwenxi/deepmd-kit that referenced this pull request Jun 17, 2026
Restore the entire tabulate_ops._registered set in the finally block rather
than just the single op under test: ensure_fake_registered() may touch
multiple op names, so per-op restore could leak module-global state across
tests. Addresses CodeRabbit review on deepmodeling#5542.
@iProzd iProzd added this pull request to the merge queue Jun 17, 2026
Merged via the queue into deepmodeling:master with commit a4c5592 Jun 17, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants