Skip to content

Generic CostModel for single-term optimization (peak-memory & batched objectives, roofline tie-break)#559

Open
evaleev wants to merge 58 commits into
masterfrom
feature/cost-model-batch-aware
Open

Generic CostModel for single-term optimization (peak-memory & batched objectives, roofline tie-break)#559
evaleev wants to merge 58 commits into
masterfrom
feature/cost-model-batch-aware

Conversation

@evaleev

@evaleev evaleev commented Jun 19, 2026

Copy link
Copy Markdown
Member

TL;DR. Generalizes single-term optimization into a pluggable CostModel framework, so new objectives can be added by writing one model type. The flagship instances delivered here are the peak-memory objectives (DensePeakSize/DensePeakSizeBatched) and a roofline-based tie-break, which together make local-correlation (PNO/CSV) residuals memory-bounded and keep reused integrals cached.

Why this PR

In CC-type methods each residual term is a product of many tensors, and the single-term optimizer chooses the order in which to contract them. Today it can minimize FLOP count (DenseFLOPs) or total operand storage (DenseSize) — both essentially order-insensitive and blind to how much memory is live at once.

For local correlation methods (PNO/OSV/CSV CCSD, F12) that blindness is the problem: peak memory, not FLOPs, is the binding resource. Density-fitted integral construction can transiently materialize an enormous intermediate that still carries a free projected-AO (PAO) index, and the contraction order is exactly what decides whether that intermediate is ever formed. A FLOP- or size-minimizing order will happily build it.

This PR teaches the single-term optimizer about peak memory. It adds peak-memory objectives, a small framework to host them (and user-defined ones), a batching model that bounds the peak by slicing the DF index, one batchability policy shared with the runtime evaluator, and — in the most recent commits — the secondary-cost (tie-break) work that makes the optimizer reliably keep reused integrals resident instead of recomputing them.

Everything new is opt-in and default-neutral: the existing DenseFLOPs/DenseSize results are bit-for-bit unchanged, and the roofline tie-break is off by default.

What's added

1. A CostModel framework (refactor)

The subset-lattice + bipartition dynamic program that drives single-term optimization is factored into a single generic driver, run_single_term_opt<Model>. Each objective is now a small model type that supplies the recurrence and the reconstruction:

  • AdditiveModel — the existing FLOPs/size objective.
  • PeakModel — the new peak-memory objective.
  • PeakBatchedModel — peak memory with batching.

The driver is public and the CostModel concept is documented, so a downstream user can define a custom objective and drive it directly. This is also what enables the DRY cleanup below.

2. Two peak-memory objectives

  • DensePeakSize — minimizes the all-co-resident peak: the maximum, over the evaluation schedule, of the combined size of every simultaneously-live tensor (intermediates and resident input leaves). Unlike FLOPs/size, contraction order is a real lever here. Validated against an independent brute-force oracle.
  • DensePeakSizeBatched — models batched evaluation: each index flagged batchable (e.g. the DF/RI auxiliary index) is treated as sliced to a target size, so an intermediate carrying it is materialized one slice at a time. The DP minimizes peak over the worst-case sliced configuration.

3. One batchability policy (BatchPolicy, core/batch_policy.hpp)

A single source of truth — is_batchable_index, per-index batch_target_size, is_volatile_leaf — consumed by both the optimizer (so it models what the runtime will do) and the runtime batched evaluator (via a make_evaluator adapter over make_batched_custom_evaluator). One object, so the two can't drift apart.

4. Tie-break quality: Pareto frontier + roofline cost

The peak objective has a primary axis (peak) and a secondary cost used to choose among schedules that tie (or nearly tie) on peak. That secondary cost turns out to matter a lot: it decides whether an expensive, reused integral is formed once and cached or recomputed every iteration. Two refinements:

  • Pareto-frontier DP (peak_flops_tolerance). A pure peak-min DP with a single per-subset winner cannot reach the global flop-minimum among peak-optimal trees (the max-recurrence has no optimal substructure for the secondary objective). The DP now carries a Pareto frontier of non-dominated (peak, secondary) points per subset. peak_flops_tolerance (default 0.10) then selects the cheapest schedule whose peak is within (1+tol) of the minimum — letting the optimizer form, e.g., a persistent 4-index integral that is peak-neutral but much cheaper across the iterations that reuse it.

  • Roofline secondary cost (RooflineParams, default off). The flop-only secondary cost under-weights bandwidth-bound contractions (modest FLOPs, large memory traffic), so it treats "fold the amplitude in early and rebuild a large intermediate every iteration" as nearly free. The secondary cost is replaced by a roofline wall-time proxy:

    cost = max( flops, beta * Q ),   Q = max( |L|+|R|+|Z|, kappa * flops / sqrt(M / c0) )
    

    beta is the machine balance (FLOPs per element of traffic); Q combines compulsory single-pass traffic with the Hong–Kung sqrt(M) finite-cache re-read bound; everything is computed on proto-explicit (block-sparse) extents and scaled by the volatile replay weight. Compute-bound (dense) contractions collapse to flops — the byte term is inert, so the dense case is unaffected by construction; bandwidth-bound (PNO single-index) contractions are charged their true traffic. beta (machine_balance) defaults to 0, which reproduces the previous flop-only tie-break exactly. Design write-up: doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md.

5. DRY cleanup

With every recurrence now living in a model type, the duplicated standalone DP code is removed: peak_cost/peak_cost_batched/reconstructed_batched_peak delegate to the models, and the five now-dead functions (peak_dp, peak_dp_batched, single_term_opt_impl, single_term_opt_peak_impl, single_term_opt_peak_batched_impl) plus struct PeakRes are deleted. Each recurrence exists in exactly one place; the independent brute_force_min_peak/batched_min_peak oracles stay as cross-checks.

6. Backend-aware batch-accumulation footprint penalty (accumulation_factor)

A batch-accumulated intermediate (K += contribution, summed over aux slices) co-resides the accumulator and the in-flight contribution on an eager backend like TiledArray — a footprint the pure-slice model omits. accumulation_factor (on BatchPolicy / OptimizeOptions / CostParams) charges accumulation_factor * sz(result) on each node that slices a batchable index, on the primary peak axis. Default 0.0 (backend-agnostic, no factorization change); MPQC defaults it to 1.0 for TA. It asserts at most one batchable index when nonzero — the per-node, once-per-node charge is single-axis-only, which the CC residual (linear in the Hamiltonian => one DF aux per term) always satisfies.

Also removed the throwaway SEQUANT_BATCH_DECLINE_DEBUG eval diagnostic added during the investigation.

Suggested review order

  1. core/optimize/single_term.hpp — the generic run_single_term_opt<Model> driver and the CostModel concept.
  2. core/optimize/cost_model.hppAdditiveModel, PeakModel, PeakBatchedModel, and roofline_op_cost.
  3. core/optimize/options.hppOptimizeOptions, RooflineParams, peak_flops_tolerance.
  4. core/batch_policy.hpp + the make_evaluator adapter — the shared policy.
  5. Tests in tests/unit/test_optimize.cpp — oracle cross-checks and the [osv] cases that motivate the tie-break.

Testing

  • Full SeQuant unit suite green (6375 assertions); new objectives cross-checked against independent brute-force oracles plus reconstruction-simulation checks; CostModel concept conformance (positive and negative).
  • New [bubble] test: the exchange quadratic bubble g.t2.t2 at water-20 extents, measuring early-K (held-whole 4-occ/2-PNO integral) vs late-K t.(gC) peaks per aux batch size. They share the peak floor (premium <0.1%), so early-K is correct — equal peak plus a persistent build-once flop win, and the memory floor is set by the aux batch size, not the factorization. Confirms accumulation_factor and the batch-size knob behave as modeled.
  • Behavior-preserving by default: DenseFLOPs/DenseSize unchanged bit-for-bit; with machine_balance = 0 the peak tie-break is identical to before.
  • Also fixes a pre-existing flaky eval test (the two batched-eval paths agree only to Loose tolerance, since batched summation order is thread-non-deterministic).
  • Downstream: MPQC CSV-CCk (he10, batched) reproduces the reference energy to 7e-15.

End-to-end validation of the tie-break work (PNO-CCSD, water clusters)

On real cluster runs, the roofline secondary cost makes a low, physically-correct volatile_weight (≈ the actual replay/iteration count) select the cached schedule that the old flop-only tie-break only reached with an artificially inflated volatile_weight. On the 20-mer the persistent PPL chain g·C → (a μ̃|Κ) → (a b|Κ) → W = (ab|cd) is built once and cached (per-iteration rebuilds of the 36 GB half-transform drop from ~60 to 0), giving roughly 10–17× faster iterations at the same volatile_weight, with bounded memory (aggregate logical working set ~112 GiB; measured node physical ~320 GiB on a 768 GB node).

Design docs

Specs and plans under doc/dev/{specs,plans}/ (2026-06).

Follow-up

MPQC repins MPQC_TRACKED_SEQUANT_TAG to this once merged (engine keywords sequant:optimize:machine_balance / :fast_mem_elems are already staged on the mpqc side).

evaleev added 30 commits June 18, 2026 20:47
…mization)

Adds doc/dev/specs spec for a CostModel abstraction that unifies single-term
optimization cost and runtime evaluation policy: peak-memory objective (pebbling
recurrence), batchability-aware footprint (monomial in batch-index extents,
peak_full/peak_slice two-mode, persistence-gated frontier), three built-in
models (DenseFLOPs, DensePeakSize, DensePeakSizeBatched), and custom injection.
… term

Phase 1 oracle revealed the pebbling recurrence computed a Sethi-Ullman
(register-style) peak that omits resident input leaves. Adopt the realistic
all-co-resident tensor peak: add per-subset L[n] (leaf-size sum) and the
bystander term L[other]+peak[child] to the DP; oracle (7.0 on the 3-leaf
example) and DP now agree. Same clean subset DP and optimal substructure.
The DensePeakSize enumerator, the peak_cost wrapper, and the public
single_term_opt Metric template parameter were undocumented. Describe the
peak-memory objective (all-co-resident model, order is a real lever) and its
Phase-1 limitation (no subnet CSE).
Make the DensePeakSizeBatched formulation concrete under the all-co-resident
model: explicit peak_full/peak_slice recurrence (full bystander terms, frontier
substitution), local batchable-frontier gate (batch-index internal AND
persistent), the validation strategy (slice mode reuses Phase-1 peak_cost at
batched extents; full mode vs a tree x order x batch-choice oracle), and the
Phase-2 OptimizeOptions plumbing (pre-CostModel).
Batchable indices slice independently (peak[n][B], B subset of the term's
batchable indices) rather than as one group, which would under-count and
mislead on multi-aux terms. Batch decision for Ki taken at the node where Ki is
internalized; objective peak[root][empty]; m=1 collapses to two-mode and ties to
Phase-1 peak_cost. Oracle and validation updated for per-index (incl. a
two-distinct-aux case).
DP gains a [B] dimension over the term's distinct batchable indices; oracle
threads the per-index slice context; reconstruction gets a full numeric
memory-simulation check. All-sliced corner ties to Phase-1 peak_cost.
Add DensePeakSizeBatched to ObjectiveFunction and two new
OptimizeOptions fields (is_batchable_index, batch_target_size).

Implement in detail namespace (single_term.hpp):
- batchable_index_list: distinct batchable indices in appearance order
- sliced_footprints: 2^m tables of subset_footprints, one per sliced-set B
- leaf_volatile_mask: bitmask of volatile leaf tensors (mirrors inline mask)

Test: "per-index batchability tables" SECTION verifies that aux.size()==2
for two distinct F-space indices, tables.size()==4 (2^2 sliced-sets), the
all-sliced footprint is strictly smaller than the unsliced one, and that
slicing only F1 shrinks only the F1-leaf footprint.
…side)

One compile-time generic driver run_single_term_dp<Model> owns the subset
lattice + bipartition enumeration; each objective becomes a CostModel type
(State + Context + leaf/init/relax/finalize/reconstruct). Four built-ins
(AdditiveModel x2, PeakModel, PeakBatchedModel) map their existing DPs;
behavior-preserving (existing tests/oracles are the regression net). Evaluator
face and mpqc deferred to Phase 4.
4 tasks, behavior-preserving, model-by-model (AdditiveModel, PeakModel,
PeakBatchedModel) + concept/custom-model test. Old standalone DP/cost functions
kept as reference oracles; per-objective equivalence tests + full existing suite
green are the gates.
… cost; cover volatile path in equivalence test
…r optimizer + eval)

Bundle the batchability triple into one BatchPolicy{is_batchable_index,
batch_target_size (per-index), is_volatile_leaf} consumed by both the optimizer
(OptimizeOptions embeds it) and a thin eval-layer make_evaluator adapter (lifts
the Tensor volatile predicate to EvalNode). Generalizes batch_target_size from
scalar to per-index function. Two stages: SeQuant (policy+adapter+ripple), then
mpqc (construct once, feed both, delete dup). Behavior-preserving.
A1 batch_target_size scalar->per-index function; A2 BatchPolicy struct embedded
in OptimizeOptions; A3 eval-layer make_evaluator adapter; B1 mpqc construct-once
+ feed-both + delete-dup (CSV-CCk energy-match validation). Behavior-preserving;
A->B mpqc-compile window noted.
Replace every std::size_t batch/target_batch_size parameter on the
batched optimizer path and in make_batched_custom_evaluator with
std::function<std::size_t(Index const&)> batch_target_size.

Slicing applies min(extent(ix), batch_target_size(ix)), so a constant
lambda [](Index const&){ return N; } reproduces old scalar-N results.

Changed:
- OptimizeOptions::batch_target_size: size_t -> function<size_t(Index)>
- sliced_footprints, peak_dp_batched, peak_cost_batched,
  single_term_opt_peak_batched_impl, reconstructed_batched_peak: same
- detail::single_term_opt and public single_term_opt overload: same
- PeakBatchedModel::batch member: size_t -> function<size_t(Index)>
- make_batched_custom_evaluator: target_batch_size is now a function;
  call sites pass target_batch_size(*K) and target_batch_size(*Kk)
  to mode_batches (which still takes a scalar)

Tests: all existing batched test calls updated to pass constant lambdas;
new SECTION("per-index batch_target_size honored") verifies that
distinct per-index sizes produce different peak costs.
evaleev added 3 commits June 19, 2026 14:35
…hs' mutual agreement

The make_evaluator-vs-hand-built comparison asserted Tight (exact) equality
between two independent batched-summation evaluations, whose accumulation order
is thread-non-deterministic; this flaked by a few ULPs. Both paths already
compare Loose against the reference for the same reason; make the mutual
comparison Loose too.
Comment thread SeQuant/core/optimize/optimize.cpp Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files should be removed before merging.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these files contain valuable details about the design.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like they would have to be rewritten significantly in order to be actually useful documentation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but no bandwidth available to write for human consumption. This is still better than nothing. This type of knowledge is not extractable from the code afterwards.

@Krzmbrzl Krzmbrzl Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not intended for humans, I vote for moving it outside of docs (in something like agent-docs) or put it in a subdirectory likedocs/for_agents. Also, the files could be renamed to something a bit more meaningful

evaleev added 9 commits June 20, 2026 17:15
…ed peak model

Prints per term (DensePeakSizeBatched): predicted peak vs the largest UNSLICED
node footprint in the DP-chosen tree, to localize whether the model under-counts
a discounted node (predicted << unsliced) or mis-ranks trees. Gated on the
SEQUANT_PEAK_DEBUG env var; no effect otherwise. TEMPORARY diagnostic -- to be
removed via git surgery before merge.
…odel fix)

The peak objectives sized a tensor with k composites sharing a proto-pair as
nocc^N * average_csv_extent^k, grossly under-counting it: the true block-sparse
volume is Sum_pairs d^k, and <d^k> >> <d>^k for heavy-tailed PNO domains (the
average is also diluted across the full nocc^N grid). The peak DP was therefore
blind to the multi-composite mu-tilde intermediates and freely built them
(water-14 OOM at 181.8 GiB; minimize-peak produced WORSE peak than flops).

footprint_counter now groups tot_indices.inner by proto-set and sizes each
member of a k-group by an optional k-aware inner_pow(c, k) = the k-th power mean
(Sum_pairs d^k / nocc^N)^(1/k), so nocc^N * inner_pow^k = Sum_pairs d^k exactly.
inner_pow is threaded through subset_footprints/sliced_footprints, the peak
models, single_term_opt, and OptimizeOptions. When empty (k=1) the behavior is
identical to before, so the [optimize] suite is unaffected.

Tetramer (cache off): peak max hw 7.11 -> 1.59 GiB, monster eliminated, now
below the flops factorization; energies match; [optimize] 309 assertions green.

(Bundled with the env-gated SEQUANT_PEAK_DEBUG diagnostic; to be separated.)
…jectives)

Extend the CSV/PNO composite fidelity fix to flops_counter and memsize_counter
via a shared inner_aware_volume() helper, so DenseFLOPs and DenseSize also size
a k-composite group over a shared proto-pair by Sum_pairs d^k (the k-th power
mean) instead of nocc^N * average^k. opt_pure_product now forwards
OptimizeOptions::inner_pow to the additive arms as well. Design audit: all four
objectives consume the same proto-index-dependent inner_pow input.

flops/size do not OOM without this (an over-built multi-composite contraction is
operation-expensive regardless), but the under-count is non-uniform across
contractions and can misrank trees, so accurate counts improve flop/size
optimality. Tetramer (cache off): flops factorization and peak unchanged
(2.74 GiB), energy matches; [optimize] 309 assertions green.
…notes)

Document the inner_pow refinement: @brief/@param/@return/@note on the shared
inner_aware_volume helper, and @param inner_pow on flops_counter, memsize_counter,
footprint_counter, subset_footprints, sliced_footprints, and single_term_opt
(detail + public). Comment-only; no code change.
…lops

The peak objectives are degenerate when the peak is set by an unavoidable
leaf/intermediate (e.g. a DF integral) that dominates many factorizations: among
equal-peak schedules the DP arbitrarily picked one, which could carry an
OSV/PNO uncontracted or form a rank-increasing outer product instead of
contracting it with its amplitude early. Add a lexicographic (peak, then
volatile-weighted flops) tie-break to PeakModel and PeakBatchedModel so the
least-work equal-peak schedule wins (volatile contractions weighted by
volatile_weight, matching the DenseFLOPs convention).

Also stop gating batching on persistence. The DensePeakSizeBatched cost model
and the runtime batched evaluator declined to slice subtrees containing a
volatile (amplitude) leaf; for footprint minimization that gate only raises the
modelled/realized peak (slicing the batch axis shrinks any intermediate carrying
it regardless of volatility, and leaves flops unchanged). Batch across the board
by default and add BatchPolicy::persistent_only (default false) to recover the
old persistent-only behavior, read identically by the optimizer and the runtime.
make_batched_custom_evaluator / make_evaluator now batch across the board by
default; the persistence-gate tests (volatile subtree -> not batched, and the
group-replay yield counts) assert the gated behavior, so opt them in with
persistent_only=true (BatchPolicy::persistent_only / the new param).
The peak objectives' DP previously kept one (peak, flops) winner per subset
with a lexicographic local tie-break, which cannot reach the global flop
minimum among peak-optimal trees: a peak-tying subtree chosen for local flop
reasons can foreclose a far cheaper completion. Replace the single-cell state
with a per-subset Pareto frontier of non-dominated (peak, flops) points
(PeakModel/PeakBatchedModel::State now a vector of FrontPoint/BFrontPoint;
relax() crosses child frontiers and prunes via pareto_insert). Reconstruct
threads the chosen frontier index through the tree.

Add peak_flops_tolerance (OptimizeOptions, default 0.10; plumbed through
single_term_opt and opt_pure_product): the final selection picks the
fewest-flops schedule whose peak is within (1+tol) of the minimum, instead of
strict peak-min. This lets the optimizer form a persistent intermediate that
is slightly larger in peak but much cheaper in (volatile-weighted) flops --
e.g. the 4-PNO ladder integral W=(ac|bd)=gCC*gCC, formed once and contracted
with the amplitude, rather than folding the amplitude into a half-transform
and recomputing the ladder on every replay. The model default stays 0.0 so
the brute-force oracle tests remain strict.

Update the [osv] tests: assert the structural invariant (the OSV-deferred
outer product is avoided by default, reproduced under persistent_only) rather
than brittle max_imed equalities, and add a PPL form-W vs fold-t test that
also documents the vw=1 (caching-off) regime where fold-t dominates W on both
peak and flops.
The peak objectives (DensePeakSize/Batched) broke ties among near-peak-equal
schedules on flops alone. That under-ranks bandwidth-bound contractions (e.g. a
single-PNO-index contraction, arithmetic intensity ~2d): modest flops, large
wall time. The flop tie-break therefore treats folding the amplitude in early
-- making a large intermediate volatile and rebuilding it every iteration --
as nearly free, and picks it below an unnecessarily high volatile_weight.

Replace the per-contraction tie-break cost with a roofline wall-time proxy:

  cost = max(flops, beta * Q),  Q = max(traffic, kappa * flops / sqrt(M / c0))

where traffic = |L|+|R|+|Z| (proto-aware footprints), beta = machine balance
(FLOPs per element of traffic), M = binding fast-memory capacity in elements,
and the sqrt(M) term is the Hong-Kung / Loomis-Whitney finite-cache re-read
bound (c0 ~ 3 resident tiles, kappa a calibratable prefactor). The volatile
replay weight still multiplies the whole op (a volatile op pays the roofline
cost on every replay). With beta == 0 (default) this is exactly the previous
flop-only tie-break -- no behavior change -- so the new RooflineParams default
leaves all existing results bit-identical.

The model is shape-aware where it must be and inert where it should be:
compute-bound (dense) contractions have max() collapse to flops (the byte term
never binds), so the dense case needs no tuning; bandwidth-bound (PNO/CSV)
contractions are charged their true memory traffic, which is what lets the
optimizer keep a reused integral persistent instead of rebuilding it per
iteration. Design: doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md (mpqc4).

Plumbed via RooflineParams through OptimizeOptions, single_term_opt, and
opt_pure_product; the DenseFLOPs/DenseSize footprint_weight is unchanged.
Design of record for the peak objectives' roofline secondary cost (commit
"optimize: roofline secondary cost for the peak-objective tie-break"): per-op
max(flops, beta*Q) with the Hong-Kung sqrt(M) finite-cache re-read term,
proto-explicit traffic, and the replay weight. Lives alongside the other
CostModel design docs under doc/dev/specs/. (Moved here from mpqc4 -- it
describes the SeQuant cost model; the code comments already point at this path.)
@evaleev evaleev requested a review from Copilot June 23, 2026 20:11
@evaleev evaleev changed the title CostModel: batch-aware peak-memory single-term optimization + DRY cleanup Generic CostModel for single-term optimization (peak-memory & batched objectives, roofline tie-break) Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends SeQuant’s single-term tensor-contraction optimizer with peak-memory objectives (including a batch-aware peak model), factors the DP driver into a generic run_single_term_opt<Model> framework, and unifies batching policy between optimizer and runtime evaluation via a shared BatchPolicy.

Changes:

  • Added peak-memory cost models (DensePeakSize, DensePeakSizeBatched) with Pareto-frontier tie-break and an optional roofline secondary cost.
  • Refactored single-term optimization into a generic DP driver + model types (AdditiveModel, PeakModel, PeakBatchedModel) and removed duplicated legacy DP implementations.
  • Introduced BatchPolicy and an eval-layer make_evaluator adapter; updated batched evaluator API to take a per-index batch-size function; updated eval tests accordingly.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_eval_ta.cpp Updates batched-evaluator test call sites to the new per-index batch-size function API; adds an adapter equivalence test for make_evaluator(BatchPolicy, …).
SeQuant/core/optimize/single_term.hpp Adds composite-aware sizing (inner_pow), batching helpers, and routes optimization through model-based driver; introduces a circular include with cost_model.hpp.
SeQuant/core/optimize/options.hpp Extends objective enum with peak objectives; adds BatchPolicy, inner_pow, roofline params, and peak_flops_tolerance.
SeQuant/core/optimize/optimize.cpp Wires new optimizer options through runtime dispatch for all objective functions, including peak and batched peak.
SeQuant/core/optimize/cost_model.hpp New: generic DP driver, CostModel concept, additive/peak/batched-peak model implementations, roofline secondary cost, and debug helpers.
SeQuant/core/eval/eval.hpp Changes make_batched_custom_evaluator batch size to a per-index function; adds persistent_only; adds make_evaluator adapter over BatchPolicy.
SeQuant/core/batch_policy.hpp New: shared batching policy container used by both optimizer and evaluator.
doc/dev/specs/2026-06-23-roofline-tiebreak-cost.md New design spec describing the roofline tie-break model and validation plan.
doc/dev/specs/2026-06-19-cost-model-phase4-batch-policy-design.md New design draft for shared BatchPolicy and eval adapter integration.
doc/dev/specs/2026-06-18-cost-model-phase3-generic-driver-design.md New design draft for generic DP driver + CostModel types refactor.
doc/dev/specs/2026-06-18-cost-model-batch-aware-design.md New design draft for batch-aware peak-memory objective formulation and validation strategy.
doc/dev/plans/2026-06-19-cost-model-phase4-plan.md New implementation plan for shared BatchPolicy across optimizer/runtime.
doc/dev/plans/2026-06-19-cost-model-phase3-plan.md New implementation plan for generic driver/model refactor.
doc/dev/plans/2026-06-19-cost-model-oldimpl-removal-plan.md New plan for removing legacy DP implementations after model refactor.
doc/dev/plans/2026-06-18-cost-model-batch-aware-plan.md New plan for peak-memory objective rollout and validation.
doc/dev/plans/2026-06-18-cost-model-batch-aware-phase2-plan.md New plan for batched peak-memory objective (multi-mode DP) rollout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1288 to +1298
[[nodiscard]] auto make_evaluator(BatchPolicy const& policy, F yielder,
ScopeGuardFactory make_scope_guard = {}) {
auto is_volatile_node = [p = policy.is_volatile_leaf](auto const& n) -> bool {
if (!n.leaf() || !n->is_tensor()) return false;
return p && p(n->as_tensor());
};
return make_batched_custom_evaluator(
std::move(yielder), policy.batch_target_size, policy.is_batchable_index,
std::move(make_scope_guard), std::move(is_volatile_node),
policy.persistent_only);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3e5ee5c. make_evaluator now checks both policy.is_batchable_index and policy.batch_target_size; if either is empty it substitutes predicates that decline batching (an accept-nothing predicate, so batch_axis returns nullopt and target_batch_size is never called) rather than forwarding the empty std::functions. This matches the BatchPolicy docs (empty => no batching) and avoids the std::bad_function_call at evaluation time.

Comment thread SeQuant/core/optimize/single_term.hpp Outdated
Comment on lines +62 to +77
template <typename Tot, typename Ixex, typename InnerPow>
double inner_aware_volume(Tot const& tot_idxs, Ixex const& ixex,
InnerPow const& inner_pow) {
double mem = ranges::accumulate(tot_idxs.outer, 1., std::multiplies{}, ixex);
if (inner_pow) {
for (auto const& c : tot_idxs.inner) {
std::size_t k = 0;
for (auto const& o : tot_idxs.inner)
if (o.proto_indices() == c.proto_indices()) ++k;
mem *= inner_pow(c, k);
}
} else {
mem = ranges::accumulate(tot_idxs.inner, mem, std::multiplies{}, ixex);
}
return mem;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3e5ee5c. Added an option_engaged() helper: for std::function/function pointers it reports the engaged (non-empty) state via contextual conversion to bool; for any other callable (e.g. a plain lambda, which has no empty state) it returns true. inner_aware_volume now does if (option_engaged(inner_pow)), so a caller may pass a bare lambda for inner_pow as well as an optionally-empty std::function.

Comment thread SeQuant/core/optimize/single_term.hpp Outdated
Comment on lines +458 to +462
// The additive arms of single_term_opt<Metric> below are routed through the
// generic CostModel driver. cost_model.hpp includes this header (for the DP
// helpers + EvalSequence + OptRes); the include guards make the cycle a no-op,
// and every helper cost_model.hpp needs is defined above this point.
#include <SeQuant/core/optimize/cost_model.hpp>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3e5ee5c by splitting the shared DP helpers (cost counters, OptRes/EvalSequence, init_results, the subset/bipartition scaffolding) into a new single_term_detail.hpp. The dependency graph is now a DAG: single_term_detail.hpp (no optimize deps) <- cost_model.hpp <- single_term.hpp. cost_model.hpp includes only the detail header, and single_term.hpp includes cost_model.hpp before defining the public single_term_opt<Metric> entry points. cost_model.hpp is now standalone-includable — verified by compiling a TU that does #include <SeQuant/core/optimize/cost_model.hpp> first with the real build flags. Also registered cost_model.hpp (previously unlisted) and the new header in CMakeLists.

evaleev added 4 commits June 23, 2026 16:33
Collapse the five positional cost-model parameters of opt::single_term_opt
(is_volatile_leaf, volatile_weight, footprint_weight, peak_flops_tolerance,
roofline) into a single CostParams struct, addressing the review request to
pass the cost knobs as one object rather than many positional arguments.

The detail overload unpacks CostParams into local aliases so the recurrence
arms are unchanged; opt_pure_product builds one CostParams from OptimizeOptions
and forwards it. Behavior is bit-identical (full unit suite: 6375 assertions,
56 cases).
…ow, empty BatchPolicy)

Three robustness fixes from the PR #559 Copilot review:

- Break the single_term.hpp <-> cost_model.hpp include cycle. The shared DP
  helpers (cost counters, OptRes/EvalSequence, init_results) move into a new
  single_term_detail.hpp; cost_model.hpp includes that instead of single_term.hpp,
  and single_term.hpp includes cost_model.hpp before defining the public
  single_term_opt<Metric> entry points. cost_model.hpp is now standalone-
  includable (verified by compiling a TU that includes it first). Register
  cost_model.hpp (previously unlisted) and single_term_detail.hpp in CMakeLists.

- inner_aware_volume() no longer requires its InnerPow template argument to be
  bool-convertible: an option_engaged() helper reports the engaged state of
  std::function/function pointers and treats any other callable (e.g. a bare
  lambda) as always engaged, so callers may pass a lambda for inner_pow.

- make_evaluator() defensively declines batching when BatchPolicy::is_batchable_index
  or batch_target_size is empty (per the BatchPolicy docs), instead of forwarding
  an empty std::function that would throw std::bad_function_call at eval time.

Full unit suite unchanged: 6375 assertions, 56 cases.
Remove the SEQUANT_PEAK_DEBUG-gated peak_batched_debug() diagnostic (a
predicted-vs-unsliced peak report used while diagnosing CSV/PNO composite
sizing) and its call site, plus the now-unused <cstdlib>/<iostream>/<ostream>
includes it required. No effect on the optimizer; full unit suite unchanged
(6375 assertions, 56 cases).
The eval trace's rss= reported this rank's RSS, which is misleading on multiple
ranks (per-rank RSS is uneven; the reader cannot recover total app memory).
Add Logger::eval.rss_reduce, an optional std::function<size_t(size_t)> the
tensor-algebra backend injects (it holds the World) to map a rank-local RSS to
the value to report -- e.g. the sum over all ranks. The eval log path runs on
every rank (printing() is level>0, identical across ranks), so a collective
reducer here is matched. Empty (default) preserves the previous per-rank value.
evaleev added 7 commits June 24, 2026 09:18
For each single-term factorization chosen by opt_pure_product, log every
intermediate's result footprint AS THE COST MODEL SIZES IT (idx_to_extent +
inner_pow) and the peak (the value DensePeakSize minimizes), gated by env var
SEQUANT_FACTORIZER_DEBUG. Reveals why the factorizer accepts a given
intermediate -- e.g. an under-sized multi-composite tensor whose computed
footprint is far below its realized size. outer{...} lists free-index extents,
inner{Np:e,...} the CSV/PNO composites (proto count N, model extent e).
Reuses tot_indices + opt::detail::inner_aware_volume (same sizing as
footprint_counter). No behavior change when the env var is unset.
The batched evaluator (make_batched_custom_evaluator) evaluates the trigger
plus any cross-term persistent finals (the replay group) in one pass via the
recursive evaluate, which does not emit Term|Begin/End. So those ops appeared
under whichever term first triggered batching, with no header of their own
(a whole residual showed a single Term|Begin and no End). Emit a Term marker
for each batched member around the group's evaluation so the per-op Eval lines
are attributed to their expression(s). Verified: balanced Begin/End counts.
The previous fix emitted one Term|Begin per batched-group member, which read as
a misleading hierarchy (e.g. 8 nested Term|Begin) -- the members are SIBLINGS
co-evaluated over the aux batches, not parent/child. Emit instead a flat
"BatchGroup | Begin | <M> members co-evaluated over <K> aux batches" header,
one "BatchMember | <expr>" per member, and "BatchGroup | End", distinct from
the top-level per-term Term|Begin/End. Verified on benzene: a group's members
list correctly and enclose the interleaved per-op Eval lines.
cache_manager derived persistence purely from the volatility (NV/V)
frontier, blind to batchability. Once batching was decoupled from V/NV
(BatchPolicy::persistent_only defaults false -> batch across the board),
a node whose result carries a batchable axis FREE -- e.g. a half-transformed
DF integral with a free projected-AO index -- is sliced by the runtime
evaluator and priced sliced by the single-term optimizer, yet cache_manager
still classified it persistent and materialized it whole, holding it across
iterations.

Hand cache_manager the same is_batchable_index predicate the optimizer and
runtime evaluator already share, and veto caching of any node whose result
carries an accepted index (neither NP repeat nor P frontier) -- the
structural counterpart of the max_footprint gate, keyed on the batch axis
rather than a byte threshold. Default never_batchable leaves every existing
caller unchanged. Add cache_manager_batch_axis_veto.
When make_batched_custom_evaluator declines to batch a node (no batchable
contracted index, persistence gate, no axis-carrying leaf, or a single
batch) whose result nonetheless carries a batchable axis FREE, that node is
built whole by the regular evaluator -- the exact free-axis intermediate the
optimizer priced sliced. Log only those declines, tagged with the gate that
fired, to std::cout under the env var (standalone: does not require the eval
trace level). This distinguishes a held CSE the cache veto already removes
from a genuine transient full build by a non-batched consumer.
PeakBatchedModel charges accumulation_factor * sz(result) on each node that
slices a batchable index -- the K += contribution co-residency an eager backend
(TiledArray) incurs while building a batch-accumulated intermediate. Plumbed
through BatchPolicy / OptimizeOptions / CostParams; default 0.0 (backend-
agnostic, no factorization change). Asserts at most one batchable index when the
factor is nonzero: the per-node, once-per-node charge is single-axis only (the
CC residual is linear in the Hamiltonian, so each term carries one DF aux).

Adds the [bubble] test: the exchange quadratic bubble g.t2.t2 at water-20
extents, measuring early-K (held-whole 4-occ/2-PNO integral) vs late-K t.(gC)
peaks per batch size. They share the peak floor (premium <0.1%), so early-K is
correct -- equal peak plus a persistent build-once flop win; the memory floor
is set by the aux batch size, not the factorization choice.
Drop the opt-in batch-decline diagnostic (the decline_log lambda and its four
call sites) added during the batched-eval investigation; it has served its
purpose. No behavior change (it was env-gated and quiet by default).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

optimize: make the optimization objective a pluggable function object

3 participants