Skip to content

fix(pt): sort nlist for compressed se_e2_a in forward_lower#5524

Merged
wanghan-iapcm merged 4 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix/compressed-se-a-unsorted-nlist
Jun 15, 2026
Merged

fix(pt): sort nlist for compressed se_e2_a in forward_lower#5524
wanghan-iapcm merged 4 commits into
deepmodeling:masterfrom
wanghan-iapcm:fix/compressed-se-a-unsorted-nlist

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes wrong energy/forces (and unstable LAMMPS MD) for compressed se_e2_a models evaluated through forward_lower — the C++/LAMMPS inference path. Reported in discussion #5438.

Root cause

The compressed tabulate_fusion_se_a op (source/lib/src/tabulate.cc, forward and grad kernels) has an is_sorted-gated early-termination:

ago = em_x[ii * nnei + nnei - 1];           // last neighbor's em_x
if (ago == xx && ll[1]==0 && ll[2]==0 && ll[3]==0 && is_sorted) break;

It stops accumulating at the first neighbor whose env-mat direction is zero. Both -1 padding and out-of-rcut neighbors (sw==0) have zero direction and the same em_x == -davg/dstd (== ago), so the op assumes all such neighbors are trailing. is_sorted defaults to true and the PT op never overrides it.

The C++/LAMMPS forward_lower neighbor list uses rcut + skin and is not distance-sorted. _format_nlist only filters out-of-rcut neighbors in its sort branch (n_nnei > nnei), which is skipped when the LAMMPS list is narrower than sum(sel) (pad-only branch). The zero-direction neighbors then land before real ones, so the op breaks early and silently drops real neighbors → wrong descriptor → wrong energy/forces → unstable MD.

Only the compressed path is affected:

  • The uncompressed embedding-net path sums over neighbors and treats zero-direction fillers identically regardless of position, so it is order-invariant.
  • Only tabulate_fusion_se_a (forward + grad) has this early-termination; se_t/se_r forward kernels do not.

It is device-independent (reproduces identically on CPU and GPU).

Fix

The wiring already exists — the model calls format_nlist(..., extra_nlist_sort=self.need_sorted_nlist_for_lower()) — but DescrptBlockSeA.need_sorted_nlist_for_lower() always returned False. Make it return self.compress:

def need_sorted_nlist_for_lower(self) -> bool:
    return self.compress

When compression is enabled this forces the sort + rcut-filter branch (in-rcut neighbors first, all padding last), restoring the op's invariant. The standard (uncompressed) route is unchanged, so there is no added cost on the common path.

Verification

  • In LAMMPS (CPU and GPU) the compressed model now matches the uncompressed model to ~2.5e-14 (was ~0.5 eV off with scrambled forces).
  • New regression test source/tests/pt/model/test_compressed_se_a_forward_lower.py runs compressed forward_lower with an unsorted, over-rcut neighbor list and compares energy + force to the uncompressed reference, parameterized over type_one_side ∈ {True, False}. It fails without this fix and passes with it; existing test_compressed_descriptor_se_a.py and test_forward_lower.py still pass.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed compressed descriptor behavior so enabling compression preserves neighbor ordering invariants and no longer causes valid neighbors to be dropped; energies and forces remain correct with unsorted/padded neighbor lists.
  • Tests

    • Added regression tests for multiple descriptor variants that validate compressed mode against uncompressed baselines using unsorted/over-cut neighbor lists, asserting energy and force fidelity.

The compressed `tabulate_fusion_se_a` op uses an `is_sorted` early-termination
that stops accumulating at the first neighbor whose env-mat direction is zero
(padding, or an out-of-rcut neighbor with sw==0), assuming such neighbors are
trailing. The C++/LAMMPS `forward_lower` neighbor list (rcut+skin, not
distance-sorted) can interleave these zero-direction neighbors before real ones,
so the op silently drops real neighbors, producing wrong energy/forces and
unstable MD. Only the compressed path is affected: the uncompressed embedding
-net path sums over neighbors and treats zero-direction fillers identically
regardless of position, and only `tabulate_fusion_se_a` (forward and grad) has
this early-termination (se_t/se_r do not).

Make `DescrptBlockSeA.need_sorted_nlist_for_lower()` return `self.compress`. The
model already wires `format_nlist(..., extra_nlist_sort=need_sorted_nlist_for_lower())`,
so when compression is enabled this forces the sort + rcut-filter branch, which
puts in-rcut neighbors first and all padding last, restoring the op's invariant.
The standard (uncompressed) route is unchanged.

Add a regression test that runs compressed `forward_lower` with an unsorted,
over-rcut neighbor list and compares energy and force to the uncompressed
reference, parameterized over type_one_side. It fails without this fix and
passes with it.

Reported in discussion deepmodeling#5438.
@dosubot dosubot Bot added the bug label Jun 13, 2026
@wanghan-iapcm wanghan-iapcm requested a review from njzjz June 13, 2026 01:19
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 34f1d1e1-60fc-4437-9de9-782c0bb3bfcc

📥 Commits

Reviewing files that changed from the base of the PR and between e732db7 and d50ff94.

📒 Files selected for processing (1)
  • source/tests/pt/model/test_compressed_se_r_forward_lower.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • source/tests/pt/model/test_compressed_se_r_forward_lower.py

📝 Walkthrough

Walkthrough

Require sorted neighbor lists for compressed SE_A by returning the compression flag from need_sorted_nlist_for_lower(). Add regression tests (SE_A and SE_R) that verify compressed forward_lower matches uncompressed references when given intentionally unsorted, over-rcut neighbor lists.

Changes

Compressed SE_A sorted neighbor list fix

Layer / File(s) Summary
Descriptor method behavior fix
deepmd/pt/model/descriptor/se_a.py
need_sorted_nlist_for_lower() now returns self.compress instead of constant False. Documentation explains the compressed tabulation op requires sorted neighbor lists to preserve its is_sorted early-termination invariant.
Regression tests for compressed SE_A
source/tests/pt/model/test_compressed_se_a_forward_lower.py
Add a test module that builds rcut-bounded reference outputs, enables compression using a min-neighbor-distance lower bound, constructs reversed over-rcut neighbor lists (padding/out-of-rcut neighbors first), runs forward_lower, and asserts energies and reduced/assembled forces match the uncompressed reference.
Regression tests for compressed SE_R
source/tests/pt/model/test_compressed_se_r_forward_lower.py
Add a test module that runs compressed vs uncompressed forward_lower on the same reversed over-rcut FLAT neighbor list for se_e2_r and asserts equality for total energy and reduced extended forces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically summarizes the main fix: enabling sorted neighbor lists for compressed se_e2_a models in the forward_lower path to prevent incorrect energy/force calculations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.19%. Comparing base (5d94bd6) to head (d50ff94).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5524   +/-   ##
=======================================
  Coverage   82.19%   82.19%           
=======================================
  Files         891      891           
  Lines      101599   101600    +1     
  Branches     4242     4242           
=======================================
+ Hits        83507    83509    +2     
+ Misses      16789    16787    -2     
- Partials     1303     1304    +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the same problem exist in se_e2_r?

Companion to test_compressed_se_a_forward_lower.py. Confirms se_e2_r is
immune to the unsorted/over-rcut forward_lower nlist that broke compressed
se_a (discussion deepmodeling#5438): tabulate_fusion_se_r has no is_sorted
early-termination and reduces over neighbors order-independently, so
need_sorted_nlist_for_lower() correctly stays False for se_r. Expected to
pass with no production code change.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pt/model/test_compressed_se_r_forward_lower.py`:
- Around line 123-134: The test must explicitly assert that the reversed
neighbor list actually contains both in-rcut and out-of-rcut neighbors before
calling forward_lower: after you get nlist2 from
extend_input_and_build_neighbor_list (keep a copy before flipping), compute
neighbor distances by gathering neighbor coordinates from coord (use the
returned nlist copy and coord.unsqueeze(0)), compare those distances to rcut to
form boolean masks, and assert at least one True (distance <= rcut) and at least
one False (distance > rcut); only then proceed to flip nlist2 and call
self.model.forward_lower so the regression precondition is enforced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6dfa7d18-ce88-4e0b-a143-5c6c69ae7602

📥 Commits

Reviewing files that changed from the base of the PR and between fe877eb and afa9d01.

📒 Files selected for processing (1)
  • source/tests/pt/model/test_compressed_se_r_forward_lower.py

Comment thread source/tests/pt/model/test_compressed_se_r_forward_lower.py Outdated
The first version compared compressed over-cut vs uncompressed CLEAN, which
conflated compression accuracy with se_r's intrinsic nlist-representation
sensitivity (se_r's mean reduction, unlike se_a, is not invariant to clean
vs over-cut nlist layout -> a ~1e-4 uncompressed-only gap). Compare
compressed vs uncompressed on the IDENTICAL unsorted over-cut nlist instead,
which isolates the op: it matches to ~1e-16, confirming tabulate_fusion_se_r
has no order/is_sorted bug, while still catching an se_a-style divergence.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
source/tests/pt/model/test_compressed_se_r_forward_lower.py (1)

125-128: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the regression precondition explicitly before the reference run.

The test assumes the generated system always yields both in-rcut and out-of-rcut neighbors after reversal, but never asserts that condition. If seed or device behavior changes, the test can become a false-positive guard.

🛡️ Proposed precondition check
         nlist = torch.flip(nlist, dims=[-1])
+        # Guard the intended scenario: reversed nlist must include both
+        # in-rcut and out-of-rcut neighbors for this configuration.
+        coord0 = ec[:, : coord.shape[0], :]
+        safe_nlist = torch.where(nlist >= 0, nlist, torch.zeros_like(nlist))
+        gather_idx = safe_nlist.view(1, -1, 1).expand(-1, -1, 3)
+        nei_coord = torch.gather(ec, 1, gather_idx).view(1, coord.shape[0], -1, 3)
+        rr = torch.linalg.norm(nei_coord - coord0[:, :, None, :], dim=-1)
+        real = nlist >= 0
+        self.assertTrue(torch.any(real & (rr <= rcut)), "No in-rcut neighbors found")
+        self.assertTrue(torch.any(real & (rr > rcut)), "No out-of-rcut neighbors found")
 
         # reference: uncompressed forward_lower on this exact nlist
         ref = self.model.forward_lower(ec, ea, nlist, mp, do_atomic_virial=False)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt/model/test_compressed_se_r_forward_lower.py` around lines 125
- 128, The test must explicitly assert the regression precondition that the
flipped neighbor list produces both in-rcut and out-of-rcut neighbors before
calling forward_lower: compute distances for the pairs in nlist (using ec and
the edge attributes ea or the model's distance utility) and assert there is at
least one distance <= self.model.cutoff and at least one distance >
self.model.cutoff (or use mp.rcut if available), then only call ref =
self.model.forward_lower(ec, ea, nlist, mp, do_atomic_virial=False); this
ensures nlist (after torch.flip) actually contains both in- and out-of-range
neighbors required by the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@source/tests/pt/model/test_compressed_se_r_forward_lower.py`:
- Around line 125-128: The test must explicitly assert the regression
precondition that the flipped neighbor list produces both in-rcut and
out-of-rcut neighbors before calling forward_lower: compute distances for the
pairs in nlist (using ec and the edge attributes ea or the model's distance
utility) and assert there is at least one distance <= self.model.cutoff and at
least one distance > self.model.cutoff (or use mp.rcut if available), then only
call ref = self.model.forward_lower(ec, ea, nlist, mp, do_atomic_virial=False);
this ensures nlist (after torch.flip) actually contains both in- and
out-of-range neighbors required by the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b16614aa-77d3-456a-a8d1-e6d052d5f333

📥 Commits

Reviewing files that changed from the base of the PR and between afa9d01 and e732db7.

📒 Files selected for processing (1)
  • source/tests/pt/model/test_compressed_se_r_forward_lower.py

…anch)

The ~1e-4 clean-vs-over-cut gap is NOT se_r-specific reduction sensitivity
(my earlier note was wrong). It is format_nlist's pad branch (width == nnei,
over-rcut neighbors, no re-sort/rcut-filter) being order-dependent -- the same
root condition as deepmodeling#5438 -- and it affects se_a (~4e-6) and se_r (~1e-4)
identically in the uncompressed path. mixed_types=True vs False is bit-identical
and not involved. The test still compares compressed vs uncompressed on the same
nlist to cancel that shared effect and isolate the (bug-free) se_r op.
@wanghan-iapcm wanghan-iapcm enabled auto-merge June 14, 2026 03:33
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Jun 14, 2026
@njzjz njzjz linked an issue Jun 14, 2026 that may be closed by this pull request
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jun 14, 2026
@wanghan-iapcm wanghan-iapcm added this pull request to the merge queue Jun 15, 2026
Merged via the queue into deepmodeling:master with commit beb50da Jun 15, 2026
70 checks passed
@wanghan-iapcm wanghan-iapcm deleted the fix/compressed-se-a-unsorted-nlist branch June 15, 2026 06:10
pull Bot pushed a commit to ishandutta2007/deepmd-kit that referenced this pull request Jun 16, 2026
…ch (deepmodeling#5529)

## Summary

`forward_lower`'s `_format_nlist` only filtered out-of-`rcut` neighbors
in its **sort** branch (`n_nnei > nnei or extra_nlist_sort`). When the
input neighbor-list width is `<= nnei` (`sum(sel)`) and
`extra_nlist_sort` is `False`, it took the **pad** branch and returned
the nlist unchanged — never dropping neighbors beyond `rcut`.

The C++/LAMMPS path (`DeepPotPT.cc` → `copy_from_nlist` → `padding`)
builds the neighbor list with `rcut + skin` and does **not** rcut-filter
before `forward_lower` (the in-code comment in `commonPT.h` is explicit:
*"No truncation or distance sorting is done — the model's format_nlist
handles that"*). Its width is the per-atom neighbor count, which on
sparse systems is `<= nnei` — exactly the case in discussion deepmodeling#5438
(width 39 < 100). In that regime, out-of-`rcut` neighbors leak into the
descriptor. Because the pad branch does not re-sort, the leaked
contribution is **order-dependent**: reversing the nlist changes the
energy by ~`1e-4` (se_r) / ~`4e-6` (se_a).

This is the same root condition as deepmodeling#5438; that PR (deepmodeling#5524) closed it only
for *compressed* se_a (by forcing `extra_nlist_sort`). The uncompressed
paths and se_r/se_t remained exposed.

## Fix

The pad branch now also drops neighbors with `rr > rcut` (no re-sort —
these descriptors reduce over neighbors order-independently). Applied to
**both** the `pt` and `dpmodel` backends (`dpmodel` is shared by
`pt_expt`).

The exported `pt_expt` graph is unaffected: export forces
`extra_nlist_sort=True`, so it always takes the sort branch; the new
pad-branch code is eager-only.

## Verification

- New regression test `test_format_nlist_overcut.py`: over-cut
(`rcut+2`) `forward_lower`, as-is and reversed, must match the
`rcut`-bounded reference for se_a and se_r (energy + force, `1e-10`).
The reversed cases **fail without this fix** and pass with it.
- After the fix: se_a reversed over-cut `rel = 0.0`, se_r `rel =
4.4e-16` (were `4.3e-6` / `1.4e-4`).
- Existing suites green on GPU: `test_jit` (all models),
`test_forward_lower`, `test_permutation`, `pt_expt` descriptor +
ener-model export, universal dp/pt model consistency.



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

* **Bug Fixes**
* Improved neighbor-list handling to drop neighbor candidates whose
distances exceed the cutoff when using over-cut neighbor lists,
preserving neighbor order while ensuring out-of-cutoff entries are
excluded.

* **Tests**
* Added regression coverage for over-cut neighbor lists in the PT
backend and common DP model backend, validating energy/forces
consistency across descriptor types and forward/reversed neighbor
ordering.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

the MD running is only working when using CPU

2 participants