Generic CostModel for single-term optimization (peak-memory & batched objectives, roofline tie-break)#559
Generic CostModel for single-term optimization (peak-memory & batched objectives, roofline tie-break)#559evaleev wants to merge 58 commits into
Conversation
…mization) Adds doc/dev/specs spec for a CostModel abstraction that unifies single-term optimization cost and runtime evaluation policy: peak-memory objective (pebbling recurrence), batchability-aware footprint (monomial in batch-index extents, peak_full/peak_slice two-mode, persistence-gated frontier), three built-in models (DenseFLOPs, DensePeakSize, DensePeakSizeBatched), and custom injection.
… term Phase 1 oracle revealed the pebbling recurrence computed a Sethi-Ullman (register-style) peak that omits resident input leaves. Adopt the realistic all-co-resident tensor peak: add per-subset L[n] (leaf-size sum) and the bystander term L[other]+peak[child] to the DP; oracle (7.0 on the 3-leaf example) and DP now agree. Same clean subset DP and optimal substructure.
…m + public dispatch)
The DensePeakSize enumerator, the peak_cost wrapper, and the public single_term_opt Metric template parameter were undocumented. Describe the peak-memory objective (all-co-resident model, order is a real lever) and its Phase-1 limitation (no subnet CSE).
Make the DensePeakSizeBatched formulation concrete under the all-co-resident model: explicit peak_full/peak_slice recurrence (full bystander terms, frontier substitution), local batchable-frontier gate (batch-index internal AND persistent), the validation strategy (slice mode reuses Phase-1 peak_cost at batched extents; full mode vs a tree x order x batch-choice oracle), and the Phase-2 OptimizeOptions plumbing (pre-CostModel).
Batchable indices slice independently (peak[n][B], B subset of the term's batchable indices) rather than as one group, which would under-count and mislead on multi-aux terms. Batch decision for Ki taken at the node where Ki is internalized; objective peak[root][empty]; m=1 collapses to two-mode and ties to Phase-1 peak_cost. Oracle and validation updated for per-index (incl. a two-distinct-aux case).
DP gains a [B] dimension over the term's distinct batchable indices; oracle threads the per-index slice context; reconstruction gets a full numeric memory-simulation check. All-sliced corner ties to Phase-1 peak_cost.
Add DensePeakSizeBatched to ObjectiveFunction and two new OptimizeOptions fields (is_batchable_index, batch_target_size). Implement in detail namespace (single_term.hpp): - batchable_index_list: distinct batchable indices in appearance order - sliced_footprints: 2^m tables of subset_footprints, one per sliced-set B - leaf_volatile_mask: bitmask of volatile leaf tensors (mirrors inline mask) Test: "per-index batchability tables" SECTION verifies that aux.size()==2 for two distinct F-space indices, tables.size()==4 (2^2 sliced-sets), the all-sliced footprint is strictly smaller than the unsliced one, and that slicing only F1 shrinks only the F1-leaf footprint.
…side) One compile-time generic driver run_single_term_dp<Model> owns the subset lattice + bipartition enumeration; each objective becomes a CostModel type (State + Context + leaf/init/relax/finalize/reconstruct). Four built-ins (AdditiveModel x2, PeakModel, PeakBatchedModel) map their existing DPs; behavior-preserving (existing tests/oracles are the regression net). Evaluator face and mpqc deferred to Phase 4.
4 tasks, behavior-preserving, model-by-model (AdditiveModel, PeakModel, PeakBatchedModel) + concept/custom-model test. Old standalone DP/cost functions kept as reference oracles; per-objective equivalence tests + full existing suite green are the gates.
… cost; cover volatile path in equivalence test
…r optimizer + eval)
Bundle the batchability triple into one BatchPolicy{is_batchable_index,
batch_target_size (per-index), is_volatile_leaf} consumed by both the optimizer
(OptimizeOptions embeds it) and a thin eval-layer make_evaluator adapter (lifts
the Tensor volatile predicate to EvalNode). Generalizes batch_target_size from
scalar to per-index function. Two stages: SeQuant (policy+adapter+ripple), then
mpqc (construct once, feed both, delete dup). Behavior-preserving.
A1 batch_target_size scalar->per-index function; A2 BatchPolicy struct embedded in OptimizeOptions; A3 eval-layer make_evaluator adapter; B1 mpqc construct-once + feed-both + delete-dup (CSV-CCk energy-match validation). Behavior-preserving; A->B mpqc-compile window noted.
Replace every std::size_t batch/target_batch_size parameter on the
batched optimizer path and in make_batched_custom_evaluator with
std::function<std::size_t(Index const&)> batch_target_size.
Slicing applies min(extent(ix), batch_target_size(ix)), so a constant
lambda [](Index const&){ return N; } reproduces old scalar-N results.
Changed:
- OptimizeOptions::batch_target_size: size_t -> function<size_t(Index)>
- sliced_footprints, peak_dp_batched, peak_cost_batched,
single_term_opt_peak_batched_impl, reconstructed_batched_peak: same
- detail::single_term_opt and public single_term_opt overload: same
- PeakBatchedModel::batch member: size_t -> function<size_t(Index)>
- make_batched_custom_evaluator: target_batch_size is now a function;
call sites pass target_batch_size(*K) and target_batch_size(*Kk)
to mode_batches (which still takes a scalar)
Tests: all existing batched test calls updated to pass constant lambdas;
new SECTION("per-index batch_target_size honored") verifies that
distinct per-index sizes produce different peak costs.
…dp, *_impl, PeakRes)
…hs' mutual agreement The make_evaluator-vs-hand-built comparison asserted Tight (exact) equality between two independent batched-summation evaluations, whose accumulation order is thread-non-deterministic; this flaked by a few ULPs. Both paths already compare Loose against the reference for the same reason; make the mutual comparison Loose too.
There was a problem hiding this comment.
These files should be removed before merging.
There was a problem hiding this comment.
these files contain valuable details about the design.
There was a problem hiding this comment.
I feel like they would have to be rewritten significantly in order to be actually useful documentation.
There was a problem hiding this comment.
probably, but no bandwidth available to write for human consumption. This is still better than nothing. This type of knowledge is not extractable from the code afterwards.
There was a problem hiding this comment.
If it's not intended for humans, I vote for moving it outside of docs (in something like agent-docs) or put it in a subdirectory likedocs/for_agents. Also, the files could be renamed to something a bit more meaningful
…ed peak model Prints per term (DensePeakSizeBatched): predicted peak vs the largest UNSLICED node footprint in the DP-chosen tree, to localize whether the model under-counts a discounted node (predicted << unsliced) or mis-ranks trees. Gated on the SEQUANT_PEAK_DEBUG env var; no effect otherwise. TEMPORARY diagnostic -- to be removed via git surgery before merge.
…odel fix) The peak objectives sized a tensor with k composites sharing a proto-pair as nocc^N * average_csv_extent^k, grossly under-counting it: the true block-sparse volume is Sum_pairs d^k, and <d^k> >> <d>^k for heavy-tailed PNO domains (the average is also diluted across the full nocc^N grid). The peak DP was therefore blind to the multi-composite mu-tilde intermediates and freely built them (water-14 OOM at 181.8 GiB; minimize-peak produced WORSE peak than flops). footprint_counter now groups tot_indices.inner by proto-set and sizes each member of a k-group by an optional k-aware inner_pow(c, k) = the k-th power mean (Sum_pairs d^k / nocc^N)^(1/k), so nocc^N * inner_pow^k = Sum_pairs d^k exactly. inner_pow is threaded through subset_footprints/sliced_footprints, the peak models, single_term_opt, and OptimizeOptions. When empty (k=1) the behavior is identical to before, so the [optimize] suite is unaffected. Tetramer (cache off): peak max hw 7.11 -> 1.59 GiB, monster eliminated, now below the flops factorization; energies match; [optimize] 309 assertions green. (Bundled with the env-gated SEQUANT_PEAK_DEBUG diagnostic; to be separated.)
…jectives) Extend the CSV/PNO composite fidelity fix to flops_counter and memsize_counter via a shared inner_aware_volume() helper, so DenseFLOPs and DenseSize also size a k-composite group over a shared proto-pair by Sum_pairs d^k (the k-th power mean) instead of nocc^N * average^k. opt_pure_product now forwards OptimizeOptions::inner_pow to the additive arms as well. Design audit: all four objectives consume the same proto-index-dependent inner_pow input. flops/size do not OOM without this (an over-built multi-composite contraction is operation-expensive regardless), but the under-count is non-uniform across contractions and can misrank trees, so accurate counts improve flop/size optimality. Tetramer (cache off): flops factorization and peak unchanged (2.74 GiB), energy matches; [optimize] 309 assertions green.
…notes) Document the inner_pow refinement: @brief/@param/@return/@note on the shared inner_aware_volume helper, and @param inner_pow on flops_counter, memsize_counter, footprint_counter, subset_footprints, sliced_footprints, and single_term_opt (detail + public). Comment-only; no code change.
…lops The peak objectives are degenerate when the peak is set by an unavoidable leaf/intermediate (e.g. a DF integral) that dominates many factorizations: among equal-peak schedules the DP arbitrarily picked one, which could carry an OSV/PNO uncontracted or form a rank-increasing outer product instead of contracting it with its amplitude early. Add a lexicographic (peak, then volatile-weighted flops) tie-break to PeakModel and PeakBatchedModel so the least-work equal-peak schedule wins (volatile contractions weighted by volatile_weight, matching the DenseFLOPs convention). Also stop gating batching on persistence. The DensePeakSizeBatched cost model and the runtime batched evaluator declined to slice subtrees containing a volatile (amplitude) leaf; for footprint minimization that gate only raises the modelled/realized peak (slicing the batch axis shrinks any intermediate carrying it regardless of volatility, and leaves flops unchanged). Batch across the board by default and add BatchPolicy::persistent_only (default false) to recover the old persistent-only behavior, read identically by the optimizer and the runtime.
make_batched_custom_evaluator / make_evaluator now batch across the board by default; the persistence-gate tests (volatile subtree -> not batched, and the group-replay yield counts) assert the gated behavior, so opt them in with persistent_only=true (BatchPolicy::persistent_only / the new param).
The peak objectives' DP previously kept one (peak, flops) winner per subset with a lexicographic local tie-break, which cannot reach the global flop minimum among peak-optimal trees: a peak-tying subtree chosen for local flop reasons can foreclose a far cheaper completion. Replace the single-cell state with a per-subset Pareto frontier of non-dominated (peak, flops) points (PeakModel/PeakBatchedModel::State now a vector of FrontPoint/BFrontPoint; relax() crosses child frontiers and prunes via pareto_insert). Reconstruct threads the chosen frontier index through the tree. Add peak_flops_tolerance (OptimizeOptions, default 0.10; plumbed through single_term_opt and opt_pure_product): the final selection picks the fewest-flops schedule whose peak is within (1+tol) of the minimum, instead of strict peak-min. This lets the optimizer form a persistent intermediate that is slightly larger in peak but much cheaper in (volatile-weighted) flops -- e.g. the 4-PNO ladder integral W=(ac|bd)=gCC*gCC, formed once and contracted with the amplitude, rather than folding the amplitude into a half-transform and recomputing the ladder on every replay. The model default stays 0.0 so the brute-force oracle tests remain strict. Update the [osv] tests: assert the structural invariant (the OSV-deferred outer product is avoided by default, reproduced under persistent_only) rather than brittle max_imed equalities, and add a PPL form-W vs fold-t test that also documents the vw=1 (caching-off) regime where fold-t dominates W on both peak and flops.
The peak objectives (DensePeakSize/Batched) broke ties among near-peak-equal schedules on flops alone. That under-ranks bandwidth-bound contractions (e.g. a single-PNO-index contraction, arithmetic intensity ~2d): modest flops, large wall time. The flop tie-break therefore treats folding the amplitude in early -- making a large intermediate volatile and rebuilding it every iteration -- as nearly free, and picks it below an unnecessarily high volatile_weight. Replace the per-contraction tie-break cost with a roofline wall-time proxy: cost = max(flops, beta * Q), Q = max(traffic, kappa * flops / sqrt(M / c0)) where traffic = |L|+|R|+|Z| (proto-aware footprints), beta = machine balance (FLOPs per element of traffic), M = binding fast-memory capacity in elements, and the sqrt(M) term is the Hong-Kung / Loomis-Whitney finite-cache re-read bound (c0 ~ 3 resident tiles, kappa a calibratable prefactor). The volatile replay weight still multiplies the whole op (a volatile op pays the roofline cost on every replay). With beta == 0 (default) this is exactly the previous flop-only tie-break -- no behavior change -- so the new RooflineParams default leaves all existing results bit-identical. The model is shape-aware where it must be and inert where it should be: compute-bound (dense) contractions have max() collapse to flops (the byte term never binds), so the dense case needs no tuning; bandwidth-bound (PNO/CSV) contractions are charged their true memory traffic, which is what lets the optimizer keep a reused integral persistent instead of rebuilding it per iteration. Design: doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md (mpqc4). Plumbed via RooflineParams through OptimizeOptions, single_term_opt, and opt_pure_product; the DenseFLOPs/DenseSize footprint_weight is unchanged.
Design of record for the peak objectives' roofline secondary cost (commit "optimize: roofline secondary cost for the peak-objective tie-break"): per-op max(flops, beta*Q) with the Hong-Kung sqrt(M) finite-cache re-read term, proto-explicit traffic, and the replay weight. Lives alongside the other CostModel design docs under doc/dev/specs/. (Moved here from mpqc4 -- it describes the SeQuant cost model; the code comments already point at this path.)
There was a problem hiding this comment.
Pull request overview
This PR extends SeQuant’s single-term tensor-contraction optimizer with peak-memory objectives (including a batch-aware peak model), factors the DP driver into a generic run_single_term_opt<Model> framework, and unifies batching policy between optimizer and runtime evaluation via a shared BatchPolicy.
Changes:
- Added peak-memory cost models (
DensePeakSize,DensePeakSizeBatched) with Pareto-frontier tie-break and an optional roofline secondary cost. - Refactored single-term optimization into a generic DP driver + model types (
AdditiveModel,PeakModel,PeakBatchedModel) and removed duplicated legacy DP implementations. - Introduced
BatchPolicyand an eval-layermake_evaluatoradapter; updated batched evaluator API to take a per-index batch-size function; updated eval tests accordingly.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/test_eval_ta.cpp |
Updates batched-evaluator test call sites to the new per-index batch-size function API; adds an adapter equivalence test for make_evaluator(BatchPolicy, …). |
SeQuant/core/optimize/single_term.hpp |
Adds composite-aware sizing (inner_pow), batching helpers, and routes optimization through model-based driver; introduces a circular include with cost_model.hpp. |
SeQuant/core/optimize/options.hpp |
Extends objective enum with peak objectives; adds BatchPolicy, inner_pow, roofline params, and peak_flops_tolerance. |
SeQuant/core/optimize/optimize.cpp |
Wires new optimizer options through runtime dispatch for all objective functions, including peak and batched peak. |
SeQuant/core/optimize/cost_model.hpp |
New: generic DP driver, CostModel concept, additive/peak/batched-peak model implementations, roofline secondary cost, and debug helpers. |
SeQuant/core/eval/eval.hpp |
Changes make_batched_custom_evaluator batch size to a per-index function; adds persistent_only; adds make_evaluator adapter over BatchPolicy. |
SeQuant/core/batch_policy.hpp |
New: shared batching policy container used by both optimizer and evaluator. |
doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md |
New design spec describing the roofline tie-break model and validation plan. |
doc/dev/specs/2026-06-19-cost-model-phase4-batch-policy-design.md |
New design draft for shared BatchPolicy and eval adapter integration. |
doc/dev/specs/2026-06-18-cost-model-phase3-generic-driver-design.md |
New design draft for generic DP driver + CostModel types refactor. |
doc/dev/specs/2026-06-18-cost-model-batch-aware-design.md |
New design draft for batch-aware peak-memory objective formulation and validation strategy. |
doc/dev/plans/2026-06-19-cost-model-phase4-plan.md |
New implementation plan for shared BatchPolicy across optimizer/runtime. |
doc/dev/plans/2026-06-19-cost-model-phase3-plan.md |
New implementation plan for generic driver/model refactor. |
doc/dev/plans/2026-06-19-cost-model-oldimpl-removal-plan.md |
New plan for removing legacy DP implementations after model refactor. |
doc/dev/plans/2026-06-18-cost-model-batch-aware-plan.md |
New plan for peak-memory objective rollout and validation. |
doc/dev/plans/2026-06-18-cost-model-batch-aware-phase2-plan.md |
New plan for batched peak-memory objective (multi-mode DP) rollout. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [[nodiscard]] auto make_evaluator(BatchPolicy const& policy, F yielder, | ||
| ScopeGuardFactory make_scope_guard = {}) { | ||
| auto is_volatile_node = [p = policy.is_volatile_leaf](auto const& n) -> bool { | ||
| if (!n.leaf() || !n->is_tensor()) return false; | ||
| return p && p(n->as_tensor()); | ||
| }; | ||
| return make_batched_custom_evaluator( | ||
| std::move(yielder), policy.batch_target_size, policy.is_batchable_index, | ||
| std::move(make_scope_guard), std::move(is_volatile_node), | ||
| policy.persistent_only); | ||
| } |
There was a problem hiding this comment.
Fixed in 3e5ee5c. make_evaluator now checks both policy.is_batchable_index and policy.batch_target_size; if either is empty it substitutes predicates that decline batching (an accept-nothing predicate, so batch_axis returns nullopt and target_batch_size is never called) rather than forwarding the empty std::functions. This matches the BatchPolicy docs (empty => no batching) and avoids the std::bad_function_call at evaluation time.
| template <typename Tot, typename Ixex, typename InnerPow> | ||
| double inner_aware_volume(Tot const& tot_idxs, Ixex const& ixex, | ||
| InnerPow const& inner_pow) { | ||
| double mem = ranges::accumulate(tot_idxs.outer, 1., std::multiplies{}, ixex); | ||
| if (inner_pow) { | ||
| for (auto const& c : tot_idxs.inner) { | ||
| std::size_t k = 0; | ||
| for (auto const& o : tot_idxs.inner) | ||
| if (o.proto_indices() == c.proto_indices()) ++k; | ||
| mem *= inner_pow(c, k); | ||
| } | ||
| } else { | ||
| mem = ranges::accumulate(tot_idxs.inner, mem, std::multiplies{}, ixex); | ||
| } | ||
| return mem; | ||
| } |
There was a problem hiding this comment.
Fixed in 3e5ee5c. Added an option_engaged() helper: for std::function/function pointers it reports the engaged (non-empty) state via contextual conversion to bool; for any other callable (e.g. a plain lambda, which has no empty state) it returns true. inner_aware_volume now does if (option_engaged(inner_pow)), so a caller may pass a bare lambda for inner_pow as well as an optionally-empty std::function.
| // The additive arms of single_term_opt<Metric> below are routed through the | ||
| // generic CostModel driver. cost_model.hpp includes this header (for the DP | ||
| // helpers + EvalSequence + OptRes); the include guards make the cycle a no-op, | ||
| // and every helper cost_model.hpp needs is defined above this point. | ||
| #include <SeQuant/core/optimize/cost_model.hpp> |
There was a problem hiding this comment.
Fixed in 3e5ee5c by splitting the shared DP helpers (cost counters, OptRes/EvalSequence, init_results, the subset/bipartition scaffolding) into a new single_term_detail.hpp. The dependency graph is now a DAG: single_term_detail.hpp (no optimize deps) <- cost_model.hpp <- single_term.hpp. cost_model.hpp includes only the detail header, and single_term.hpp includes cost_model.hpp before defining the public single_term_opt<Metric> entry points. cost_model.hpp is now standalone-includable — verified by compiling a TU that does #include <SeQuant/core/optimize/cost_model.hpp> first with the real build flags. Also registered cost_model.hpp (previously unlisted) and the new header in CMakeLists.
Collapse the five positional cost-model parameters of opt::single_term_opt (is_volatile_leaf, volatile_weight, footprint_weight, peak_flops_tolerance, roofline) into a single CostParams struct, addressing the review request to pass the cost knobs as one object rather than many positional arguments. The detail overload unpacks CostParams into local aliases so the recurrence arms are unchanged; opt_pure_product builds one CostParams from OptimizeOptions and forwards it. Behavior is bit-identical (full unit suite: 6375 assertions, 56 cases).
…ow, empty BatchPolicy) Three robustness fixes from the PR #559 Copilot review: - Break the single_term.hpp <-> cost_model.hpp include cycle. The shared DP helpers (cost counters, OptRes/EvalSequence, init_results) move into a new single_term_detail.hpp; cost_model.hpp includes that instead of single_term.hpp, and single_term.hpp includes cost_model.hpp before defining the public single_term_opt<Metric> entry points. cost_model.hpp is now standalone- includable (verified by compiling a TU that includes it first). Register cost_model.hpp (previously unlisted) and single_term_detail.hpp in CMakeLists. - inner_aware_volume() no longer requires its InnerPow template argument to be bool-convertible: an option_engaged() helper reports the engaged state of std::function/function pointers and treats any other callable (e.g. a bare lambda) as always engaged, so callers may pass a lambda for inner_pow. - make_evaluator() defensively declines batching when BatchPolicy::is_batchable_index or batch_target_size is empty (per the BatchPolicy docs), instead of forwarding an empty std::function that would throw std::bad_function_call at eval time. Full unit suite unchanged: 6375 assertions, 56 cases.
Remove the SEQUANT_PEAK_DEBUG-gated peak_batched_debug() diagnostic (a predicted-vs-unsliced peak report used while diagnosing CSV/PNO composite sizing) and its call site, plus the now-unused <cstdlib>/<iostream>/<ostream> includes it required. No effect on the optimizer; full unit suite unchanged (6375 assertions, 56 cases).
The eval trace's rss= reported this rank's RSS, which is misleading on multiple ranks (per-rank RSS is uneven; the reader cannot recover total app memory). Add Logger::eval.rss_reduce, an optional std::function<size_t(size_t)> the tensor-algebra backend injects (it holds the World) to map a rank-local RSS to the value to report -- e.g. the sum over all ranks. The eval log path runs on every rank (printing() is level>0, identical across ranks), so a collective reducer here is matched. Empty (default) preserves the previous per-rank value.
For each single-term factorization chosen by opt_pure_product, log every
intermediate's result footprint AS THE COST MODEL SIZES IT (idx_to_extent +
inner_pow) and the peak (the value DensePeakSize minimizes), gated by env var
SEQUANT_FACTORIZER_DEBUG. Reveals why the factorizer accepts a given
intermediate -- e.g. an under-sized multi-composite tensor whose computed
footprint is far below its realized size. outer{...} lists free-index extents,
inner{Np:e,...} the CSV/PNO composites (proto count N, model extent e).
Reuses tot_indices + opt::detail::inner_aware_volume (same sizing as
footprint_counter). No behavior change when the env var is unset.
The batched evaluator (make_batched_custom_evaluator) evaluates the trigger plus any cross-term persistent finals (the replay group) in one pass via the recursive evaluate, which does not emit Term|Begin/End. So those ops appeared under whichever term first triggered batching, with no header of their own (a whole residual showed a single Term|Begin and no End). Emit a Term marker for each batched member around the group's evaluation so the per-op Eval lines are attributed to their expression(s). Verified: balanced Begin/End counts.
The previous fix emitted one Term|Begin per batched-group member, which read as a misleading hierarchy (e.g. 8 nested Term|Begin) -- the members are SIBLINGS co-evaluated over the aux batches, not parent/child. Emit instead a flat "BatchGroup | Begin | <M> members co-evaluated over <K> aux batches" header, one "BatchMember | <expr>" per member, and "BatchGroup | End", distinct from the top-level per-term Term|Begin/End. Verified on benzene: a group's members list correctly and enclose the interleaved per-op Eval lines.
cache_manager derived persistence purely from the volatility (NV/V) frontier, blind to batchability. Once batching was decoupled from V/NV (BatchPolicy::persistent_only defaults false -> batch across the board), a node whose result carries a batchable axis FREE -- e.g. a half-transformed DF integral with a free projected-AO index -- is sliced by the runtime evaluator and priced sliced by the single-term optimizer, yet cache_manager still classified it persistent and materialized it whole, holding it across iterations. Hand cache_manager the same is_batchable_index predicate the optimizer and runtime evaluator already share, and veto caching of any node whose result carries an accepted index (neither NP repeat nor P frontier) -- the structural counterpart of the max_footprint gate, keyed on the batch axis rather than a byte threshold. Default never_batchable leaves every existing caller unchanged. Add cache_manager_batch_axis_veto.
When make_batched_custom_evaluator declines to batch a node (no batchable contracted index, persistence gate, no axis-carrying leaf, or a single batch) whose result nonetheless carries a batchable axis FREE, that node is built whole by the regular evaluator -- the exact free-axis intermediate the optimizer priced sliced. Log only those declines, tagged with the gate that fired, to std::cout under the env var (standalone: does not require the eval trace level). This distinguishes a held CSE the cache veto already removes from a genuine transient full build by a non-batched consumer.
PeakBatchedModel charges accumulation_factor * sz(result) on each node that slices a batchable index -- the K += contribution co-residency an eager backend (TiledArray) incurs while building a batch-accumulated intermediate. Plumbed through BatchPolicy / OptimizeOptions / CostParams; default 0.0 (backend- agnostic, no factorization change). Asserts at most one batchable index when the factor is nonzero: the per-node, once-per-node charge is single-axis only (the CC residual is linear in the Hamiltonian, so each term carries one DF aux). Adds the [bubble] test: the exchange quadratic bubble g.t2.t2 at water-20 extents, measuring early-K (held-whole 4-occ/2-PNO integral) vs late-K t.(gC) peaks per batch size. They share the peak floor (premium <0.1%), so early-K is correct -- equal peak plus a persistent build-once flop win; the memory floor is set by the aux batch size, not the factorization choice.
Drop the opt-in batch-decline diagnostic (the decline_log lambda and its four call sites) added during the batched-eval investigation; it has served its purpose. No behavior change (it was env-gated and quiet by default).
TL;DR. Generalizes single-term optimization into a pluggable
CostModelframework, so new objectives can be added by writing one model type. The flagship instances delivered here are the peak-memory objectives (DensePeakSize/DensePeakSizeBatched) and a roofline-based tie-break, which together make local-correlation (PNO/CSV) residuals memory-bounded and keep reused integrals cached.Why this PR
In CC-type methods each residual term is a product of many tensors, and the single-term optimizer chooses the order in which to contract them. Today it can minimize FLOP count (
DenseFLOPs) or total operand storage (DenseSize) — both essentially order-insensitive and blind to how much memory is live at once.For local correlation methods (PNO/OSV/CSV CCSD, F12) that blindness is the problem: peak memory, not FLOPs, is the binding resource. Density-fitted integral construction can transiently materialize an enormous intermediate that still carries a free projected-AO (PAO) index, and the contraction order is exactly what decides whether that intermediate is ever formed. A FLOP- or size-minimizing order will happily build it.
This PR teaches the single-term optimizer about peak memory. It adds peak-memory objectives, a small framework to host them (and user-defined ones), a batching model that bounds the peak by slicing the DF index, one batchability policy shared with the runtime evaluator, and — in the most recent commits — the secondary-cost (tie-break) work that makes the optimizer reliably keep reused integrals resident instead of recomputing them.
Everything new is opt-in and default-neutral: the existing
DenseFLOPs/DenseSizeresults are bit-for-bit unchanged, and the roofline tie-break is off by default.What's added
1. A
CostModelframework (refactor)The subset-lattice + bipartition dynamic program that drives single-term optimization is factored into a single generic driver,
run_single_term_opt<Model>. Each objective is now a small model type that supplies the recurrence and the reconstruction:AdditiveModel— the existing FLOPs/size objective.PeakModel— the new peak-memory objective.PeakBatchedModel— peak memory with batching.The driver is public and the
CostModelconcept is documented, so a downstream user can define a custom objective and drive it directly. This is also what enables the DRY cleanup below.2. Two peak-memory objectives
DensePeakSize— minimizes the all-co-resident peak: the maximum, over the evaluation schedule, of the combined size of every simultaneously-live tensor (intermediates and resident input leaves). Unlike FLOPs/size, contraction order is a real lever here. Validated against an independent brute-force oracle.DensePeakSizeBatched— models batched evaluation: each index flagged batchable (e.g. the DF/RI auxiliary index) is treated as sliced to a target size, so an intermediate carrying it is materialized one slice at a time. The DP minimizes peak over the worst-case sliced configuration.3. One batchability policy (
BatchPolicy,core/batch_policy.hpp)A single source of truth —
is_batchable_index, per-indexbatch_target_size,is_volatile_leaf— consumed by both the optimizer (so it models what the runtime will do) and the runtime batched evaluator (via amake_evaluatoradapter overmake_batched_custom_evaluator). One object, so the two can't drift apart.4. Tie-break quality: Pareto frontier + roofline cost
The peak objective has a primary axis (peak) and a secondary cost used to choose among schedules that tie (or nearly tie) on peak. That secondary cost turns out to matter a lot: it decides whether an expensive, reused integral is formed once and cached or recomputed every iteration. Two refinements:
Pareto-frontier DP (
peak_flops_tolerance). A pure peak-min DP with a single per-subset winner cannot reach the global flop-minimum among peak-optimal trees (the max-recurrence has no optimal substructure for the secondary objective). The DP now carries a Pareto frontier of non-dominated(peak, secondary)points per subset.peak_flops_tolerance(default 0.10) then selects the cheapest schedule whose peak is within(1+tol)of the minimum — letting the optimizer form, e.g., a persistent 4-index integral that is peak-neutral but much cheaper across the iterations that reuse it.Roofline secondary cost (
RooflineParams, default off). The flop-only secondary cost under-weights bandwidth-bound contractions (modest FLOPs, large memory traffic), so it treats "fold the amplitude in early and rebuild a large intermediate every iteration" as nearly free. The secondary cost is replaced by a roofline wall-time proxy:betais the machine balance (FLOPs per element of traffic);Qcombines compulsory single-pass traffic with the Hong–Kungsqrt(M)finite-cache re-read bound; everything is computed on proto-explicit (block-sparse) extents and scaled by the volatile replay weight. Compute-bound (dense) contractions collapse toflops— the byte term is inert, so the dense case is unaffected by construction; bandwidth-bound (PNO single-index) contractions are charged their true traffic.beta(machine_balance) defaults to 0, which reproduces the previous flop-only tie-break exactly. Design write-up:doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md.5. DRY cleanup
With every recurrence now living in a model type, the duplicated standalone DP code is removed:
peak_cost/peak_cost_batched/reconstructed_batched_peakdelegate to the models, and the five now-dead functions (peak_dp,peak_dp_batched,single_term_opt_impl,single_term_opt_peak_impl,single_term_opt_peak_batched_impl) plusstruct PeakResare deleted. Each recurrence exists in exactly one place; the independentbrute_force_min_peak/batched_min_peakoracles stay as cross-checks.6. Backend-aware batch-accumulation footprint penalty (
accumulation_factor)A batch-accumulated intermediate (
K += contribution, summed over aux slices) co-resides the accumulator and the in-flight contribution on an eager backend like TiledArray — a footprint the pure-slice model omits.accumulation_factor(onBatchPolicy/OptimizeOptions/CostParams) chargesaccumulation_factor * sz(result)on each node that slices a batchable index, on the primary peak axis. Default0.0(backend-agnostic, no factorization change); MPQC defaults it to1.0for TA. It asserts at most one batchable index when nonzero — the per-node, once-per-node charge is single-axis-only, which the CC residual (linear in the Hamiltonian => one DF aux per term) always satisfies.Also removed the throwaway
SEQUANT_BATCH_DECLINE_DEBUGeval diagnostic added during the investigation.Suggested review order
core/optimize/single_term.hpp— the genericrun_single_term_opt<Model>driver and theCostModelconcept.core/optimize/cost_model.hpp—AdditiveModel,PeakModel,PeakBatchedModel, androofline_op_cost.core/optimize/options.hpp—OptimizeOptions,RooflineParams,peak_flops_tolerance.core/batch_policy.hpp+ themake_evaluatoradapter — the shared policy.tests/unit/test_optimize.cpp— oracle cross-checks and the[osv]cases that motivate the tie-break.Testing
CostModelconcept conformance (positive and negative).[bubble]test: the exchange quadratic bubbleg.t2.t2at water-20 extents, measuring early-K (held-whole 4-occ/2-PNO integral) vs late-Kt.(gC)peaks per aux batch size. They share the peak floor (premium <0.1%), so early-K is correct — equal peak plus a persistent build-once flop win, and the memory floor is set by the aux batch size, not the factorization. Confirmsaccumulation_factorand the batch-size knob behave as modeled.DenseFLOPs/DenseSizeunchanged bit-for-bit; withmachine_balance = 0the peak tie-break is identical to before.Loosetolerance, since batched summation order is thread-non-deterministic).he10, batched) reproduces the reference energy to 7e-15.End-to-end validation of the tie-break work (PNO-CCSD, water clusters)
On real cluster runs, the roofline secondary cost makes a low, physically-correct
volatile_weight(≈ the actual replay/iteration count) select the cached schedule that the old flop-only tie-break only reached with an artificially inflatedvolatile_weight. On the 20-mer the persistent PPL chaing·C → (a μ̃|Κ) → (a b|Κ) → W = (ab|cd)is built once and cached (per-iteration rebuilds of the 36 GB half-transform drop from ~60 to 0), giving roughly 10–17× faster iterations at the samevolatile_weight, with bounded memory (aggregate logical working set ~112 GiB; measured node physical ~320 GiB on a 768 GB node).Design docs
Specs and plans under
doc/dev/{specs,plans}/(2026-06).Follow-up
MPQC repins
MPQC_TRACKED_SEQUANT_TAGto this once merged (engine keywordssequant:optimize:machine_balance/:fast_mem_elemsare already staged on the mpqc side).