Fix ~13x force-evaluation regression in classic ACEModel#326
Merged
Conversation
The default `ace1_model` / classic `ACEModel` force evaluation regressed
~13x in the v0.10 (EquivariantTensors) series versus 0.9.1.
Root cause: a type instability in `evaluate_ed` (src/models/ace.jl).
`EquivariantTensors.pullback` is not type-stable (returns Tuple{Any,Any}),
so `∂Rnl` / `∂Ylm` reached the inline gradient-assembly loops as `Any`,
making every element-wise product dynamically dispatched. The numerical
kernels were ~30us but the call took ~400us and allocated ~764 KB/site,
with ~80% of the time spent in the assembly loops.
Fix: move the assembly into a function barrier (`_assemble_grad_ed!`) so
Julia re-specialises on the concrete runtime types. evaluate_ed drops from
~386us to ~30us for 40 neighbours; forces are bit-identical (max|Δg| ≈ 1e-15).
Cross-version regression vs 0.9.1 collapses from ~13x to ~1.15x.
Note: the originally-suspected cause (ET autograd forces) was refuted by
benchmark — the ET path is not the regression.
Also adds performance regression tooling under benchmark/ to prevent
recurrence:
- common.jl: shared model/system setup
- bench_forces_regression.jl: classic vs ET forces/energy (time + allocs)
- bench_crossversion.jl (+ worker): release-vs-release comparison
- profile_classic_ed.jl: evaluate_ed component profile
- benchmarks.jl: PkgBenchmark suite
- .github/workflows/Benchmark.yml: non-blocking BenchmarkCI judge (PR vs base)
- README.md, FORCE_REGRESSION_FINDINGS.md
Verified: test/models/test_ace.jl and test/models/test_calculator.jl pass
(finite-difference force checks, Zygote-through-loss).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Benchmark job failed at `Pkg.add("BenchmarkCI")` — the workflow only
added the ACEregistry, so the General registry was absent and BenchmarkCI
could not be resolved.
Also, BenchmarkCI's judge-against-base cannot work on this PR: the benchmark
suite does not exist on `main` yet, so there is no baseline to compare to.
Switch to a robust run-and-report approach:
- add both the General and ACE registries
- run the PkgBenchmark suite via `benchmarkpkg(".")` (verified locally)
- post results to the job summary
- keep it non-blocking, threads pinned
Base-branch `judge` (e.g. via BenchmarkCI) can be enabled in a follow-up once
the suite is on `main`. Docs updated accordingly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous attempt resolved the registries but failed with
"Package BenchmarkTools not found": PkgBenchmark propagates the *calling*
process's active project to the benchmark subprocess, and the step ran from
the global environment instead of benchmark/Project.toml.
Instantiate and run the suite with `--project=benchmark` (matches the verified
local invocation `julia --project=benchmark -e 'benchmarkpkg(".")'`).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@cortner OK to merge now or should we wait and fix in ET? My opinion is the function barrier does no harm and makes it work with current ET. |
Member
|
this is definitely ok for me to merge, this would be ok like this even if we fix the type instability in ET |
This was referenced Jun 29, 2026
Collaborator
Author
|
I'll plan to merge this and #327 tag a patch release to get the regression fix to users |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The default
ace1_model/ classicACEModelforce evaluation regressed ~13× in the v0.10 (EquivariantTensors) series versus the previous release (0.9.1). This PR fixes it and adds regression tooling.Root cause
A type instability in
evaluate_ed(src/models/ace.jl).EquivariantTensors.pullbackis not type-stable (returnsTuple{Any,Any}), so∂Rnl/∂Ylmreached the inline gradient-assembly loops asAny, making every element-wise product dynamically dispatched. The numerical kernels were ~30 µs, but the call took ~400 µs and allocated ~764 KB/site — ~80% of the time was in the assembly loops.The originally-suspected cause (ET autograd forces) was refuted by benchmark: the ET path is not the regression; in fact the broken classic path was slower than ET.
Fix
Move the gradient assembly into a function barrier (
_assemble_grad_ed!) so Julia re-specialises on the concrete runtime types.evaluate_ed(40 neighbours)Forces are bit-identical (max|Δg| ≈ 1e-15).
Regression tooling (prevent recurrence)
benchmark/common.jl— shared model/system setupbenchmark/bench_forces_regression.jl— classic vs ET forces/energy (time + allocations)benchmark/bench_crossversion.jl(+ worker) — release-vs-release comparison in isolated envsbenchmark/profile_classic_ed.jl—evaluate_edcomponent profilebenchmark/benchmarks.jl— PkgBenchmarkSUITE.github/workflows/Benchmark.yml— non-blocking BenchmarkCI judge (PR vs base on same runner, threads pinned)benchmark/README.md,benchmark/FORCE_REGRESSION_FINDINGS.mdVerification
test/models/test_ace.jlandtest/models/test_calculator.jlpass (finite-difference force checks, Zygote-through-loss).Follow-up (not in this PR)
Optimisations to the ET path (analytic
site_gradsto remove 12–154 MB/eval Zygote allocations; shared interaction graph across stacked calculators) — to be planned separately.🤖 Generated with Claude Code