Skip to content

benchmarks: reusable registry, new model types, new phases, CI smoke#735

Draft
FBumann wants to merge 68 commits into
masterfrom
benchmark-suite-charter
Draft

benchmarks: reusable registry, new model types, new phases, CI smoke#735
FBumann wants to merge 68 commits into
masterfrom
benchmark-suite-charter

Conversation

@FBumann
Copy link
Copy Markdown
Collaborator

@FBumann FBumann commented May 28, 2026

Overhaul of the internal benchmark suite

Brings benchmarks/ from a hand-wired pytest collection into a
typer-driven suite with cross-version sweeps, per-PR CodSpeed CI, and
plotly views.

Core pieces

  • Reusable model registry. from benchmarks import REGISTRY exposes
    10 self-registering specs (LP / QP / MILP / SOS / piecewise / sparse
    networks / PyPSA carbon) with build(size) + feature & phase tags.
    Adding a model is one file.
  • Phase coverage. Five phases per spec: build, matrices,
    lp_write, netcdf round-trip, solver_handoff (highs/gurobi/
    mosek/xpress). Plus PyPSA end-to-end.
  • One CLI. python -m benchmarks --help covers run / smoke / sweep
    / compare / plot / notebook / memory{save,sweep,compare}.
  • Cross-version sweeps. sweep 0.6.7 0.7.0 builds one uv venv per
    version; same flow for memory via memray.Tracker (model
    construction excluded from the peak).
  • Plotly views. plot <snapshots> picks scaling (1), scatter
    (2, default), or sweep heatmap (3+). --facets {phase, model}
    splits across subplots. Auto-detects timing vs memory JSON.
  • Per-PR CodSpeed + Dependabot loop. CodSpeed CI gives cachegrind-
    attributed regression detection on every PR. [benchmarks] pins
    perf-relevant deps individually (numpy/scipy/xarray/pandas/polars/
    dask/highspy/netcdf4); tooling deps grouped into one monthly PR.
    Each perf-relevant bump → one PR → attributed CodSpeed delta.
  • Notebook walkthrough at benchmarks/notebooks/registry_usage.ipynb,
    executed in CI so examples can't rot.

Sample output: 0.6.7 → 0.7.0

Default scatter view — each test at (baseline_time, ratio), top-right
corner = real regressions (slow tests that got slower):

01-timing-scatter

Same comparison, memory dimension (peak RSS via memray.Tracker):

02-memory-scatter

Per-phase delta breakdown (compare --facets phase):

03-timing-compare-phase

Going forward

  • Enable CodSpeed: install the GitHub app on the org, add the repo,
    drop CODSPEED_TOKEN into Actions secrets. Until then the CodSpeed
    job fails but smoke + notebook execution still run.
  • Add a model: drop a file under benchmarks/models/, register it.
    Phase coverage and CLI surface follow automatically.
  • Add a phase: one test_<phase>.py parametrized off the registry.
  • Bump pins: [benchmarks] is the perf-attribution surface;
    [project.dependencies] stays loose so downstream consumers see no
    change.

FBumann and others added 30 commits May 27, 2026 23:04
…smoke

Refactors the internal benchmark suite around a reusable ModelSpec /
REGISTRY pattern so adding a model is one self-registering file with
metadata (features, applicable phases, sizes, optional deps). Other tests
and scripts can import it via `from benchmarks import REGISTRY`.

New model specs cover gaps in the existing coverage:
- milp: general (non-binary) integers (capacitated facility location)
- qp: continuous quadratic objective (diagonal portfolio)
- sos: SOS1 multi-mode generation (Model.add_sos_constraints)
- piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation)
- masked: sparse-route transportation using mask= on add_variables

SOS and piecewise specs gate their own registration on API availability,
so the suite stays runnable on older linopy.

New phase tests:
- test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/
  to_mosek/to_xpress across applicable models, skipping per-solver when
  the solver isn't installed. Uses stable lp.io wrappers (not the new
  Solver.from_name API) for backward compatibility.
- test_netcdf.py: separate to_netcdf / read_netcdf benchmarks.

CI: new benchmark-smoke.yml runs the suite under --quick
--benchmark-disable on PRs, so refactors that break a model spec get
caught early. Timings stay off CI (~35s smoke locally, no regression
tracking).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per
spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full
timing pass completes in ~2 minutes instead of 20-45.

ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``:

- ``--quick``  → ``size <= quick_threshold``  (CI smoke)
- default      → ``size <= long_threshold``   (medium-cost regression)
- ``--long``   → no cap                       (full sweep)

Verified locally:
- --quick: 227 passed / 230 skipped / 35s
- default: 333 passed / 124 skipped / 45s
- --long : 457 passed /   0 skipped / 2m13s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA
  import + example loading dominates the smoke wall-clock; the model
  still runs in default and --long modes.
- Lower every other spec's quick_threshold to its smallest size, so
  --quick exercises one size per model across all phases. The default
  tier (which uses long_threshold) still gives broad regression coverage.

Verified locally:
- --quick:  85 passed / 372 skipped / 18.5s   (was 35s)
- default: 333 passed / 124 skipped / 44.8s   (unchanged)
- --long : 457 passed /   0 skipped / 2m11s   (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
benchmarks/notebooks/registry_usage.py is the canonical walkthrough for
the model registry. Authored in jupytext percent format so it triples
as:

- runnable Python script (CI executes it on every PR)
- notebook in JupyterLab / VSCode with the jupytext extension
- readable doc on GitHub (markdown cells render directly)

Covers: import, lookup by name, iterate, filter_by feature/phase,
parametrize-your-own-pytest pattern, one-off tracemalloc profiling,
and the three CLI tiers.

CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step
right after the pytest smoke — so doc rot fails the build instead of
hiding until someone next opens the file.

README: new "Worked walkthrough" subsection points at the notebook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace jupytext-style ``registry_usage.py`` with a proper
  ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb,
  nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``.
- Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table)
  to ModelSpec. Visible in pytest -v output, in interactive Python, and as
  rich HTML in Jupyter cells.
- Notebook simplified to lean on the new reprs: explicit-attribute prints
  in sections 2-5 replaced by bare expression evaluations.
- README points at the .ipynb and notes the "launch jupyter from repo root"
  convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``python -m benchmarks <command>`` with typer subcommands:

- list / show / filter — introspect the registry
- smoke               — pytest --quick --benchmark-disable (CI)
- run [--long --phase --model --filter --json]
                      — pytest --benchmark-only with knobs
- notebook            — execute the registry-usage notebook
- memory save/compare — replaces the argparse main in memory.py

Modern typer style throughout: Annotated[...] for every parameter,
Literal[...] for the --phase choice, function docstrings for command
help. ``--help`` is auto-generated and is the source of truth — README
and the notebook just point at it instead of duplicating the menu.

CI smoke now calls ``python -m benchmarks smoke`` and
``python -m benchmarks notebook``. memory.py keeps its save/compare
functions but loses the argparse layer. typer added to the [benchmarks]
extra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups after checking typer's docs:

- Pin typer to the latest release (==0.26.2) in the [benchmarks] extra,
  so the CLI's behaviour is reproducible across dev / CI / contributor
  machines.

- Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument
  to the idiomatic ``typer.Context`` + ``context_settings`` pattern
  (allow_extra_args, ignore_unknown_options). With the old style, any
  trailing ``--flag`` would be parsed as an unknown option and rejected;
  with ctx.args, ``python -m benchmarks run --long -- --tb=short -x``
  actually works.

Other patterns already match typer's recommended style: Annotated[...],
Literal for choice params, docstrings for command help, sub-apps via
add_typer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of pinning for stable measurement:

- ``[benchmarks]`` extra in pyproject pins the test infra exactly
  (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4,
  nbconvert, typer). Loose enough that the sweep workflow can install
  varying linopy versions on top.

- ``benchmarks/requirements.lock`` is the full transitive resolution
  (numpy, scipy, pandas, xarray, plus everything else). Generated via
  ``uv pip compile --no-emit-package linopy`` so the lockfile pins the
  *environment around linopy* without pinning linopy itself — that lets
  the same lockfile work for both current-tip regression runs and
  cross-version sweeps.

README clarifies that the lockfile gives consistency over time on the
same machine, not absolute reproducibility across machines (CPU / cache
/ memory bandwidth still matter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv
per version with uv, installs the benchmark infra (lockfile by default,
or the [benchmarks] pinned subset with --no-use-lock) plus the target
linopy in a single resolution pass, and runs the suite. Snapshots land
in ``<output-dir>/linopy-<version>.json``.

Useful for bootstrapping a perf history against published linopy
releases. The current benchmark code runs against each linopy version
(constant measurement layer); the ``_API_AVAILABLE`` gates on sos /
piecewise specs make older linopy versions skip those phases gracefully.

Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end
in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark
run). Plain releases (0.4.0) and pip specs (git+https://...) both work
via the ``_linopy_install_spec`` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README previously duplicated content from three sources:
- the notebook (models table, registry-usage code blocks)
- ``--help`` (quick-reference command list)
- a stale memory.py invocation (since replaced by ``memory save/compare``)

After the consolidation each surface has a clear single job:

- README: 1-paragraph what, setup (uv sync / lockfile), size-tier table
  (architectural), pointers to the notebook + ``--help``, metrics blurb.
- ``notebooks/registry_usage.ipynb``: the walkthrough — registry import,
  lookup / iterate / filter, parametrize your own pytest, profiling.
- ``python -m benchmarks --help``: command reference, autogenerated by
  typer from docstrings / Annotated[..., Option(...)] declarations.

Drops ~140 lines from the README; nothing actually disappears — it just
lives in the one place that owns it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pypsa removed from the [benchmarks] pinned set, from the sweep
``--no-use-lock`` install list, and from the lockfile. The
``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip``
so collection no longer fails without pypsa; ``pypsa_scigrid`` already
had ``requires=("pypsa",)`` so its phase tests skip gracefully.
Install pypsa separately when you want those benchmarks.

Notebook (registry_usage.ipynb) rewritten as a proper operator guide:

- Architecture overview + per-phase measurement table up front.
- Registry walkthrough (lookup / iterate / filter) kept as the spine.
- Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check).
- ``Running`` section now embeds ``--help`` output live via a
  ``show_help()`` helper that shells out to ``python -m benchmarks ...
  --help``. The doc stays in sync with the typer implementation
  automatically — change a flag in cli.py, re-run the notebook,
  documentation updates.
- New sections cover timing snapshots, memory snapshots, the
  cross-version sweep, and lockfile regeneration.

README gains an explicit "pypsa is optional" note in setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough

Mirrors ``run``'s filter knobs and applies them to every version's
pytest invocation. Also switches to the ``typer.Context`` +
``context_settings`` pattern so trailing args after ``--`` are
forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use).

    python -m benchmarks sweep 0.6.7 --phase build --model basic
    python -m benchmarks sweep 0.6.7 -- --tb=short -x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks compare a.json b.json [-- --columns=...]``
shells out to ``pytest-benchmark compare`` so the whole suite stays
under one entry point. Accepts any number of snapshots; first is the
baseline.

When called with no arguments — or with paths that don't exist — it
prints a copy-paste-ready list of snapshots found under
``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version
runs). If nothing's saved yet, points at the ``run --json`` flow.

For memory snapshots use ``memory compare`` instead — different
format, different tool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-paste)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ame)

pytest-benchmark's own default emits 10 columns side-by-side, which is
unreadable for any non-trivial comparison. Wrapper now prepends
``--columns=median,iqr --sort=name`` so the table is two stats wide
and the (baseline, candidate) pair of each test sits together
alphabetically.

Defaults are only applied when the user hasn't already set the flag,
so trailing pass-through overrides still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arg split

Two fixes for the ``compare`` UX surfaced by the cross-version sweep:

- Default to ``--group-by=fullname`` so each test gets its own mini
  table showing (baseline, candidate) side-by-side with the
  parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for
  regressions in the median column. Combined with the existing
  ``--columns=median,iqr --sort=name`` defaults, the output goes from
  10-columns-wide-on-one-line to a focused two-column per-test view.

- Switch ``compare`` away from a positional ``list[Path]`` argument and
  parse ``ctx.args`` by hand: typer's positional list was greedily
  grabbing trailing ``--group-by=fullname`` etc. (and the ``--``
  separator didn't escape it either). Now arg-splitting is explicit:
  anything starting with ``-`` is pytest-benchmark pass-through,
  everything else is a snapshot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried switching to the canonical typer pattern (``--`` separator for
pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args``
still greedily ate the trailing options. There's no clean typer/click
idiom for "list-typed positional + pass-through" — workarounds are
manual splitting, bounding the positional count, or named flags.

Manual splitting is the most pragmatic: snapshots come first, once we
see any flag-like token the rest is forwarded to pytest-benchmark.
That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in
SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``,
and the value-taking flags whose value doesn't start with ``-``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three opinionated interactive HTML views over pytest-benchmark JSONs,
auto-picked from snapshot count or set explicitly via ``--view``:

- **compare** (2 snapshots) — horizontal bar chart of per-test median
  delta, sorted by magnitude, green→red colormap. The "did this PR
  regress anything?" picture in one glance, vs pytest-benchmark's
  60-individual-SVGs which are useless for that workflow.
- **sweep** (3+ snapshots) — heatmap of median ratio relative to the
  first snapshot, rows = tests, columns = labels. Pairs with the
  ``sweep`` subcommand.
- **scaling** (1 snapshot) — log-log median vs ``n`` for
  size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by
  phase. Shows whether linopy's complexity scales as expected.

plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is
lazy-imported inside ``plot`` so the rest of the suite stays usable
without it (with a clear error if a user tries ``plot`` and it's
missing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er absolutes

- New ``benchmarks/plotting.py`` module owns the three views
  (``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a
  ``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports
  ``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the
  view functions so importing the module without plotly works.

- ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto``
  so values render inside each bar / cell.

- Hover info upgraded:
  - compare hover shows the per-test median of *both* snapshots
    (formatted to 4 significant figures) in addition to the delta %.
  - sweep hover shows the absolute median (s) alongside the ratio, via
    a customdata + hovertemplate plumbed through ``update_traces``.

scaling view already shows the absolute median on hover by virtue of
being a line chart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ride

For microbenchmarks the lowest observed time is closest to the "true"
cost — background noise (GC, context switches, thermal throttling) can
only slow things down. pytest-benchmark's own ``--sort`` default is
``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark
and Alexandrescu's "Speed is found in the minds of people" all argue
similarly.

Changes:

- ``plot`` defaults to ``--metric min`` (was median). Accepts
  ``--metric median|mean|max`` to override. The metric drives the bar
  values, heatmap ratios, scaling-curve y-axis, and the hover labels.
- ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in
  ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg.
- ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was
  median,iqr / name). The auto-ratios next to each ``min`` flag
  regressions in the same readable form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For a suite where test costs span six orders of magnitude (knapsack
microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta
overweights cheap tests — a 100% regression on a 1µs test ranks above
a 1% regression on a 2s test, but the absolute impacts are 1µs vs
24ms.

Changes:

- Default sort is now ``absolute`` (``b - a`` in seconds). Bar values
  are the time delta with SI-prefix formatting on the x-axis (24 ms,
  240 µs, etc.). Big actual-time impacts float to the bottom.
- ``--sort relative`` keeps the old percent behaviour.
- Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless
  of which one drives the sort, so you can read off whichever lens.
- ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform
  signature but ignore it (no two-snapshot diff there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compare bar chart forces a choice between sorting by relative %
or absolute Δ. Both have blind spots: pure-relative makes microbenchmark
noise look catastrophic, pure-absolute hides real algorithmic regressions
on fast paths.

The two-axis scatter resolves the tension visually. Per test:

- x = baseline time (log scale)
- y = candidate / baseline ratio
- colour = absolute Δ

A point is a real regression worth chasing only when it sits in the
top-right — slow tests that got slower. Top-left (high ratio, tiny
absolute) reads as microbenchmark noise; bottom-right (high absolute,
ratio ≈ 1) was already slow and didn't change. A dashed reference line
at ``y=1`` makes "no change" trivial to see.

The view is auto-picked for nothing (compare wins for 2 snapshots);
pass ``--view scatter`` explicitly to get it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-axis scatter now scales beyond a single baseline-vs-candidate
pair. With N>=3 inputs the first is still the baseline (reference); each
subsequent snapshot becomes one animation frame. Use the slider / play
button to scrub through versions and watch tests drift across releases.

Implementation:

- First snapshot is the baseline. Skipped from the frame set (would
  trivially be y=1 everywhere).
- Each subsequent snapshot contributes points at (baseline_time,
  ratio, Δ) per overlapping test. ``animation_frame="version"`` does
  the per-frame slicing; ``category_orders`` preserves input order in
  the slider so the timeline reads left→right.
- ``range_x`` / ``range_y`` are pinned to the global min/max so the
  camera doesn't jump between frames.
- 2 inputs still produces a static scatter (no animation overhead).

Considered ``facet_col`` but it gets cramped past ~4 versions — the
slider scales to arbitrary length.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… |Δ|

Two small but high-value tweaks for the multi-snapshot scatter:

- The baseline snapshot now contributes its own animation frame where
  every point sits at ratio=1, Δ=0. Gives the animation a "before
  anything happened" anchor: hit play and watch points drift from the
  baseline horizon outward. Previously the first frame was the first
  candidate, which made the visual feel as if it started mid-story.

- ``range_color`` is pinned to the 95th-percentile absolute Δ
  (±p95). One huge outlier no longer drags the colour scale and
  flattens everyone else to white; outliers saturate at the bound,
  the rest of the distribution stays readable. Colour-bar label notes
  ``Δ (s, p95-clipped)`` so the convention is explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no change" line sits at y=1, but with asymmetric data (e.g. some
2x regression, no symmetric speedup), it landed near the bottom of the
visible range and improvements got squeezed near the floor.

Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 -
max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the
window is symmetric around 1.0, so regressions above and improvements
below are equally readable regardless of the data skew.

The colour scale keeps the p95-clipped centred-at-0 treatment from the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h warn

Three safe fixes from a code review of benchmarks/plotting.py:

- Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with
  the floor bumped from 400 to 500. At 25+ tests the y-axis labels
  were colliding; now they breathe.
- plot_scaling reads ``params.size`` (the cleanly-stored int from
  parametrize) and only falls back to the id regex if absent. The
  ``model`` name still needs the regex because pytest-benchmark
  serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a
  full params switch isn't possible here — but the size path is now
  robust to test-id rename.
- plot_compare surfaces the mismatch between snapshots: prints a
  stderr line with the test counts only in A / only in B / common,
  and embeds the same as a subtitle in the figure. Silent intersection
  was the worst-case footgun.

Skipped (per review note): the default-view swap for 3+ snapshots
(sweep → scatter) is a judgement call left for the user. Default
output filename change (clobber on each run) also skipped — they want
to decide whether per-view filenames are worth the API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view>.html

Two coupled changes setting up the notebook-embedding path:

- ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling``
  in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure``
  instead of writing to disk + returning a count. The CLI does the
  ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)``
  is exported as a helper so the CLI still emits a "N points → path"
  status line.

  This unblocks rendering plots directly in jupyter — call
  ``plot_compare(...)`` and Jupyter's display hook renders the Figure
  inline.

- Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html``
  (was ``benchmark-plot.html`` in cwd). Matches where snapshots already
  land (and is gitignored), and the per-view filename means consecutive
  runs of different views don't clobber each other.

Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max()
or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) —
both triggered ``ValueError: The truth value of an array with more
than one element is ambiguous``. Replaced with explicit ``is None``
checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ospection

The ``n_points(fig)`` helper added in the previous commit walked
``fig.data`` traces and called ``len(trace.x)`` to recover the test
count. That's backwards — the count is sitting right there in the
source DataFrame at render time, no need to reach into the rendered
plot.

Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for
compare / sweep / scaling; ``df["test"].nunique()`` for scatter
(rows are per-(test, version) so the raw len double-counts).

n_points helper dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two refinements to the end-to-end plotting section:

- tqdm wraps the subprocess loop that generates the two snapshots.
  Each ``--quick --phase build`` run takes ~10 s; tqdm makes the
  ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget
  vs console bar based on context.

- Plots are now rendered via ``python -m benchmarks plot --view <name>``
  rather than direct ``plot_compare`` / ``plot_scatter`` imports.
  A small ``cli_plot(view, snapshots)`` helper runs the subprocess,
  reads the generated HTML, and inlines it via ``IPython.display.HTML``.
  Demonstrates the actual user-facing CLI path inside the notebook
  rather than the internal API.

Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting
overhead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FBumann and others added 4 commits May 28, 2026 18:04
…r row

Three coupled fixes after --facets model produced empty bars across
every facet:

1. Phase-aware short y-labels. Facet by phase → ``model-n=N``; facet
   by model → ``phase-n=N``. The facet header already encodes the
   other dimension.
2. Independent y-axes per facet (``matches=None``). Each facet's
   y-axis lists only its own categories — no empty rows for tests
   that belong to other facets.
3. Shared y-tick labels per row via ``_hide_non_leftmost_yticks``.
   Hidden on every facet except the leftmost column of the wrap grid
   so labels appear once per row, not at every subplot's edge.
4. Per-facet height calculation. Without faceting we sized for total
   bar count; with independent per-facet category sets we now size
   for ``max-rows-in-any-facet × n-facet-rows``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small UX wins for plot consumers:

- Default view for 2 snapshots is now ``scatter`` (was ``compare``).
  Each test sits at its own ``(baseline_time, ratio)`` coords with
  hover labels — no aggregation, so size-semantics mismatches across
  models (n=100 for ``basic`` vs ``pypsa_scigrid``) don't muddy the
  picture. ``compare`` (delta bars) is still one ``--view compare``
  away.
- Re-export ``load_long_df`` from ``benchmarks`` so callers can grab
  the tidy DataFrame in one line without importing the plotting
  module. ``df, unit = load_long_df([Path(...), Path(...)])``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pins

Three pieces, one PR:

- ``memory sweep <versions>...`` mirrors the timing sweep but runs
  ``memray.Tracker`` measurements inside each per-version uv venv.
  Output lands at ``.benchmarks/memory/linopy-<ver>.json`` and ``plot``
  auto-detects the ``peak_mib`` shape.

- Symmetric override flags so cross-version sweeps can use uniform
  measurement counts: ``--rounds N`` on ``sweep`` / ``run`` forces
  ``--benchmark-min-rounds=N --benchmark-max-time=0`` (default is
  pytest-benchmark's per-test auto-tuning), and ``--repeats N`` on
  ``memory sweep`` / ``memory save`` takes min-of-N peak per
  measurement (default 1; memory peaks are mostly deterministic).

- Single source of truth for sweep dep pins. Both sweeps now read the
  ``[benchmarks]`` extra from ``pyproject.toml`` at runtime via
  ``_benchmarks_extra_pins()`` instead of duplicating two short pin
  lists in ``cli.py``. Bumping the extra propagates automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-PR cachegrind-based regression detection plus a feedback loop that
attributes upstream dep perf changes to specific Dependabot bumps:

- New ``codspeed`` job in ``benchmark-smoke.yml``: runs ``pytest
  --quick --codspeed`` on every PR via the CodSpeedHQ action. Needs
  a ``CODSPEED_TOKEN`` repo secret to post comments; without it the
  job fails gracefully.

- ``--quick`` now skips PyPSA end-to-end via a collection hook in
  ``conftest.py``. The PyPSA network is ~30s native; cachegrind would
  make it minutes, and the signal CodSpeed is meant to catch lives in
  the micro paths.

- Pin tiering in ``[benchmarks]``: perf-relevant deps (``numpy``,
  ``scipy``, ``xarray``, ``pandas``, ``polars``, ``dask``, ``highspy``,
  ``netcdf4``) get individual ``==`` pins so each Dependabot bump
  produces one attributed CodSpeed delta. Tooling deps (``pytest`` &
  plugins, ``nbconvert``, ``typer``, ``plotly``) are also pinned but
  grouped in ``dependabot.yml`` so they batch into a single PR.

- Loose ``[project.dependencies]`` stays untouched — downstream linopy
  consumers keep their existing resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented May 28, 2026

It went so well that we can consider wiring up https://codspeed.io/. I would like to test it.
@FabianHofmann you would need to accept their app to be installed on the repo. But maybe a quick chat would be good

FBumann and others added 23 commits May 28, 2026 22:27
- Drop ``benchmarks/requirements.lock`` and the ``--use-lock/--no-use-lock``
  toggle on ``sweep`` and ``memory sweep``. The ``[benchmarks]`` extra in
  pyproject already pins every measurement-relevant direct dep
  (numpy/scipy/xarray/pandas/polars/dask/highspy/…) — uv resolves the same
  pyproject input deterministically into each per-version venv, so "same
  deps, only linopy varies" comes for free without maintaining a separate
  lockfile.
- Relocate the walkthrough out of ``benchmarks/notebooks/`` to
  ``benchmarks/walkthrough.md`` (next to README), and delete the now-empty
  notebooks/ directory along with the obsolete ``registry_usage.ipynb``.
- Replace ``jupyter nbconvert`` with Jupytext for the walkthrough. The
  ``notebook`` subcommand now executes the ``.md`` directly; ``--build``
  regenerates a gitignored sibling ``walkthrough.ipynb`` for editor-agnostic
  viewing (one-way conversion — no bidirectional pairing, so PyCharm/VSCode
  work the same as JupyterLab). Add ``jupytext==1.17.4`` to the
  ``[benchmarks]`` extra.
- Condense ``benchmarks/README.md`` from 128 → 45 lines: scope/install/lockfile
  rationale + walkthrough launch only. Phase coverage, CLI surfaces, metric
  rationale, memory commands, and extending guide are now load-bearing in
  the walkthrough.
- ``list --details`` and ``show`` now use ``typer.secho`` for dim headers /
  cyan spec names / dim attribute labels (auto-strips when piped — ``list |
  grep`` stays clean).
- Drop the "microbenchmark" framing where it overreached: the suite has
  millisecond-to-second-scale tests, not microbenchmarks. Rephrase the
  scatter-quadrant prose ("cheap tests with big ratio swings — noise, not
  real change") in plotting.py, cli.py, and the README metric rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end walkthrough of the benchmarks CLI: registry introspection
(``list --details``, ``show``, ``filter``), a two-snapshot regression
workflow (``run --quick --phase build`` → ``compare`` table → ``plot
--view scatter`` / ``--view compare`` rendered inline via
``IPython.display.HTML``), peak-RSS snapshots (``memory save`` / ``memory
compare``), an "other CLI surfaces" reference table, and the "add a new
model" three-step recipe.

The file is the load-bearing documentation for the suite — README only
covers install and how to open it. CI executes it on every PR via
``python -m benchmarks notebook`` so the examples can't silently rot.
Contributors regenerate the gitignored ``walkthrough.ipynb`` sibling via
``python -m benchmarks notebook --build`` and open it in JupyterLab,
PyCharm, or VSCode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous pin (1.17.4) was a guess; the resolved version in the
dev environment is 1.19.3. Align pyproject so per-version sweep venvs
install the same jupytext the local dev env uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a ``--smoke`` flag to ``sweep`` that runs the same pytest invocation
as the top-level ``smoke`` command in each per-version venv: every
model/phase fires once at the quickest size, no timings, ~10–20 s per
version. Useful before bumping a perf-sensitive pin like ``numpy`` to
confirm every linopy version we'd sweep against still installs, imports,
and exercises the suite cleanly.

Surfaces real binary-compat issues (e.g. a ``netcdf4`` wheel mismatched
against the pinned ``numpy``) that declared-constraint resolution can't
catch on its own.

The smoke pytest args are now a shared ``_SMOKE_PYTEST_ARGS`` constant
so the top-level command and ``sweep --smoke`` stay in sync — single
source for the definition of "smoke."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Delete the module-level ``__iter__`` in registry.py — module-level
  ``__iter__`` is uncallable (you can't ``for x in module``), so it was
  dead code. Drops the unused ``Iterator`` import too.
- Rename ``memory.DEFAULT_PHASES`` → ``MEMORY_PHASES``. It collided with
  ``registry.DEFAULT_PHASES`` (frozenset of 9 phase tags) — same name,
  different shape, both imported elsewhere. Footgun for the next reader.
- Rewrite the stale comment above the ``[benchmarks]`` pyproject extra.
  It said "Not pinned here: numpy / scipy / pandas / xarray" directly
  above the lines pinning exactly those four, and referenced the
  ``requirements.lock`` we just deleted. Replace with the actual story:
  every measurement-relevant direct dep is pinned; sweep installs the
  same set into each per-version venv.
- Rename the CI smoke step from "Execute registry-usage notebook" to
  "Execute walkthrough notebook" to match the file the ``notebook``
  command now executes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orts

Two related dead-surface trims flagged in code review:

- ``SOLVER_BUILD`` was an aspirational phase tag ("generic
  Solver.from_name(..., io_api='direct')") that was declared, exported,
  listed in ``DEFAULT_PHASES``, and added to ``sos``'s phase set — but
  never wired into any test. ``test_solver_handoff.py`` only exercises
  ``TO_HIGHSPY``/``GUROBIPY``/``MOSEK``/``XPRESS``. Remove it from
  registry.py, ``__init__.py`` exports, and the sos spec. If we ever
  want the generic ``from_name(..., io_api='direct')`` path measured,
  it can come back as a real phase with a real test.

- ``benchmarks/models/__init__.py`` was re-exporting 20 names
  (``BASIC_SIZES``, ``build_basic``, …) that nothing outside the
  ``models`` package referenced. The documented access path is
  ``REGISTRY["<name>"]``; the only thing ``__init__.py`` needs to do
  is trigger each submodule's ``register(...)`` side-effect. Collapse
  to a single ``from benchmarks.models import basic, …`` import block.
  Adding a new model is now one new file plus one line in this block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the *what each phase does to a model* logic from the pytest test
files and ``memory.py`` into a single ``benchmarks/phases.py`` module.
Both drivers now import the same verbs — pytest wraps them in
``benchmark(...)``, memray wraps them in ``Tracker(...)`` — so the
measured operation is defined once.

The drift risk this targets is silent: if ``test_matrices`` grows a new
matrix attribute and ``memory.py``'s inline copy doesn't, the
timing/memory snapshots end up measuring different operations and
``plot`` shows non-overlapping sets, no error. Likewise the solver list
(currently 4 wrappers) was duplicated between ``test_solver_handoff.py``
and a hardcoded ``to_highspy`` in memory.

Touchpoints:

- ``benchmarks/phases.py`` (new): ``touch_matrices``, ``write_lp``
  (with ``progress=False`` pinned here), ``write_netcdf``,
  ``read_netcdf`` re-export, and a ``SOLVER_HANDOFFS`` tuple of
  ``(solver_name, registry_phase_tag, wrapper)``.
- ``test_matrices.py``: drops the local ``_access_matrices``.
- ``test_lp_write.py``: uses ``write_lp`` (pin lives in one place now).
- ``test_netcdf.py``: uses ``write_netcdf`` + ``read_netcdf`` from
  phases.
- ``test_solver_handoff.py``: ``_SOLVER_PHASES`` becomes
  ``SOLVER_HANDOFFS`` from phases; ``_make_params`` loop unchanged.
- ``memory.py``: inline matrices/lp_write/netcdf bodies replaced with
  ``touch_matrices`` / ``write_lp`` / ``write_netcdf`` / ``read_netcdf``
  from phases. The solver-handoff branch now looks up ``"highs"`` by
  name in ``SOLVER_HANDOFFS`` rather than ``[0]`` — reordering the
  tuple no longer silently swaps which solver gets measured.

The id seam — memory.py's hand-rolled ``f"...::test_X[name-n=size]"``
strings vs pytest's collected node ids — is intentionally not
abstracted (the netcdf double-emit and the ``highs-`` solver prefix make
a shared id generator more framework than it's worth). Instead, a new
``benchmarks/test_memory_id_alignment.py`` exercises both sides for one
cheap spec and asserts every memory-emitted id is in pytest's
collection. A test rename now fails this guard immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``sweep`` and ``memory sweep`` had ~80 lines of near-identical per-version
plumbing: uv availability check, ``=== linopy X ===`` banner, tempdir +
uv venv creation, install pass with the same args, ``PYTHONPATH`` setup,
and parallel failure-reason printing. Two copies that would drift the
moment one side gained anything (a wheel-cache flag, a constraint pin, a
different stderr handler).

Extract a ``_provision_venvs(versions, tmp_prefix)`` generator that
yields one ``_ProvisionedVenv`` record per version. On success, the
record carries ``python`` + ``env``; on failure, ``failed_at`` names the
step that broke ("venv" or "install") and the caller skips its
per-version action. Each tempdir cleanup happens when the generator
advances, so ``break``-ing out of the caller's loop still tears down
cleanly via the generator close protocol.

After the extraction:

- ``sweep`` shrinks from ~135 lines of venv plumbing + action to ~70
  lines of just the action (smoke pytest invocation vs full
  ``--benchmark-only`` invocation + snapshot check).
- ``memory sweep`` shrinks similarly — only the ``memory save``
  invocation and the snapshot-relocation bookkeeping remain.
- Future sweep flavours get the venv plumbing for free.

No user-facing behaviour change; the failure messages and the banner
output are identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous pin was on the last numpy 1.x release; 2.x has been the
default for ~2 years now and is what most linopy users actually run.
Pinning benchmarks to 1.26.4 meant the suite was measuring a code path
nobody hits anymore.

Verified safe via ``sweep --smoke`` across the realistic sweep set
(0.5.8, 0.6.0, 0.6.7, 0.7.0) — every linopy version installs, imports,
and exercises the suite (every model build / phase fire) cleanly
against numpy 2.4.6.

The pre-existing ``netcdf4`` binary-incompat warning (``numpy.ndarray
size changed``) is unchanged by this bump — it's a wheel-vs-ABI
mismatch from ``netcdf4==1.7.4`` that's present under both numpy 1.26.4
and 2.4.6, doesn't fail any test, and is a separate concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical sweep --smoke verification:

- linopy <0.5.1 declares ``numpy<2.0`` in its metadata, so any
  numpy 2.x ``==`` pin in our ``[benchmarks]`` extra makes uv refuse
  to resolve those versions (install fails before any code runs).
- Relaxing to ``numpy<2.0`` lets the older versions install and run.
  Verified: 0.3.15, 0.4.0, 0.4.4, 0.5.0, 0.7.0 all pass ``sweep --smoke``
  under the relaxed pin.

uv resolves ``numpy<2.0`` to 1.26.4 on every current platform (the last
numpy 1.x release; numpy is done with the 1.x line), so the practical
reproducibility property of "every per-version venv gets the same numpy"
is preserved despite the looser-looking constraint.

Reverts the bump in 7d3e474. We'll go back to a 2.x ``==`` pin once
we drop pre-0.5.1 from sweep coverage — at which point ``sweep --smoke``
is the right tool to re-verify, same way it found this floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Honour the ``==`` pin convention used by every other measurement-relevant
dep in ``[benchmarks]``. ``numpy<2.0`` in the previous commit gave the
same practical result (uv resolves to 1.26.4) but broke the "every
direct dep is pinned exactly" property the surrounding pins rely on for
reproducibility.

Empirically verified ``sweep --smoke`` still covers the full 0.3.x →
0.7.0 range under the exact pin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…HANDOFFS

Two coupled bugs that were making ``sweep`` produce meaningless
cross-version timings.

**Bug 1: silent linopy shadowing.** ``_provision_venvs`` ran the
per-version pytest with the cwd inherited from the user's shell —
typically the repo root, which contains a ``linopy/`` package (the
one we're developing). Python prepends cwd to ``sys.path`` as
``''``, so ``import linopy`` resolved to the dev tree rather than the
venv's installed version. Every sweep run was measuring dev linopy
against itself; the per-version timings were noise on the same code.
(Previously the function also set ``PYTHONPATH=repo_root`` for
``import benchmarks``, which independently caused the same shadowing
even with a different cwd.)

Fix: create an isolated import root per version — a fresh tempdir
containing only a symlink ``benchmarks → repo_root/benchmarks``. The
sweep callers now run subprocesses with ``cwd=import_dir`` and no
``PYTHONPATH``. ``import benchmarks`` resolves via the symlink;
``import linopy`` falls through to site-packages → the requested
version. Added ``import_dir`` to ``_ProvisionedVenv`` and threaded it
through both ``sweep`` and ``memory sweep`` call sites (memory
discovery now looks under ``import_dir/.benchmarks/memory`` for the
``memory save`` output before moving it to ``output_dir``).

**Bug 2: SOLVER_HANDOFFS eagerly imports linopy.io.to_xpress, which
doesn't exist in any released linopy.** With shadowing in effect we
never noticed; after the isolation fix, even ``sweep --smoke 0.7.0``
fails collection because ``lio.to_xpress`` is an AttributeError.

Fix: build ``SOLVER_HANDOFFS`` via ``getattr(lio, name, None)`` and
filter out wrappers that aren't present in the installed linopy. The
tuple shape stays the same; older versions silently drop solvers
they don't support. ``memory.py``'s ``next("highs", ...)`` lookup
defaults to ``None`` and skips the solver_handoff memory phase rather
than emitting an unmatchable test id.

Consequence the user should expect: ``sweep --smoke`` against older
linopy versions now surfaces real install / runtime / API
incompatibilities rather than passing silently. Versions whose
metadata installs cleanly but whose code imports fail under our
pinned ``xarray`` / etc. will report ``smoke failed`` — that's the
correct signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xarray 2025.3.0 moved ``xarray.core.rolling`` to a different location.
linopy <=0.5.0 imports it directly, so the previous ``xarray==2025.9.0``
pin made any sweep against those releases fail at ``import linopy`` —
real signal that was masked while sweep was silently running the dev
linopy instead.

Pin to the last release before the rename (2025.1.2). Coverage now
extends down to 0.4.4 cleanly. The realistic floor is 0.4.4 — 0.4.0's
``to_file`` lacks the ``progress`` kwarg, and reaching back further
would need version-specific shims that aren't worth maintaining.

Verified: ``sweep --smoke 0.4.4 0.5.0 0.5.8 0.6.7 0.7.0 -k basic`` all
green; local ``smoke`` still passes on the dev linopy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 0.2.0

linopy added the ``progress`` kwarg to ``Model.to_file`` in 0.4.1. The
suite's ``write_lp`` verb passes ``progress=False`` to keep the
progress-bar overhead out of the measurement, which means anything
older than 0.4.1 raised ``TypeError`` and failed sweep smoke.

Check once at import time (``inspect.signature``) whether the kwarg is
present; if not, fall back to the native call. Branchless on the hot
path — the check resolves once when phases.py loads.

Empirically extends sweep coverage from 0.4.4 down to 0.2.0 with no
other changes — roughly three years of historical releases now in
scope. 0.1.x has further API drift (``add_variables`` signature) and
0.0.x has pre-pyproject metadata that uv can't install, both out of
scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct pins in ``[benchmarks]`` keep results reproducible *within* one
sweep call, but unpinned transitive deps can drift between sweep calls
days apart — a delta could then come from a numpy patch release rather
than the linopy change you wanted to attribute it to.

Add an ``--as-of <DATE>`` flag to both ``sweep`` and ``memory sweep``
that passes ``--exclude-newer`` to uv. The entire transitive
resolution is frozen to releases on or before the date; running the
same sweep set + the same ``--as-of`` value at any later point
reproduces the same dep tree (modulo PyPI yanking).

Plumbed through ``_provision_venvs(as_of=...)`` so both call sites
stay single-source. Default is unchanged — no ``--as-of`` ⇒ latest
resolution, matching prior behaviour.

Empirically verified: ``--as-of 2026-05-01`` correctly rejects the
install when ``pytest-codspeed==5.0.3`` (released later) is in the
pin set; ``--as-of 2026-05-29`` resolves cleanly and ``sweep --smoke``
passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the symlink-isolation fix in e7f9c5b:

1. Preflight check per provisioned venv. After symlinking benchmarks/
   into the import_dir, run a tiny ``python -c`` that imports linopy
   from the cwd we're about to use and asserts ``linopy.__file__`` is
   under the venv's prefix. If a future change reintroduces the
   dev-linopy shadow (PYTHONPATH=repo, missing PYTHONDONTWRITEBYTECODE
   side-effect, pytest import-mode bump, …), this fails loudly with
   "isolation leak: linopy resolved to <path>, not the venv" rather
   than silently corrupting every snapshot in the sweep. New
   ``failed_at`` value ``"isolation"`` lets callers record this the
   same way they already record venv/install failures.

2. ``PYTHONDONTWRITEBYTECODE=1`` in the subprocess env. The symlink
   resolves to the real benchmarks/ source tree, so every sweep
   subprocess would otherwise write fresh ``.pyc`` files into the
   user's working tree — harmless (Python is held constant so the
   bytecode is valid) but it mutates the checkout and would risk write
   contention if sweep ever becomes parallel. One env var keeps each
   run pure.

Verified: shadowing simulated by re-setting ``PYTHONPATH=repo`` is now
caught by the preflight with the expected assertion message; happy
path ``sweep 0.7.0 --smoke`` still passes; ``benchmarks/__pycache__``
is untouched after a sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-version isolation root held a symlink to repo_root/benchmarks.
Replace it with a filtered copy so the sweep runs on Windows (no symlink
privilege needed) and no per-version subprocess — including its
__pycache__ writes — can touch the working tree.

Drops the now-redundant PYTHONDONTWRITEBYTECODE: with a copy, bytecode
lands in the throwaway tempdir, so the working tree is structurally
untouchable rather than protected by an env var. ignore_patterns skips
the executed notebook and cruft to keep the copy cheap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`benchmarks.bench` times or memory-profiles any callable in-process on
the current tree — a registry builder, a phase verb on a hand-built
model, a one-off lambda — and returns a result that round-trips through
the existing snapshot/plot machinery (`load_long_df`). Three entry
points (`time`, `memory`, `compare`) plus `TimingResult` / `MemoryResult`
/ `ResultSet` with `to_snapshot` / `to_df` / rich Jupyter reprs.

To make the memray peak-measurer reusable, memory.py no longer raises at
import on Windows: the check moves into a `_require_memray()` called by
each measuring entry point, and `_measure_peak` is promoted to public
`measure_peak` (back-compat alias kept). bench reuses it for the memory
path.

Adds a "Benchmarking custom things" section to walkthrough.md (executes
end-to-end under the CI notebook run) and re-exports `bench` from the
package.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotate the benchmark modules so `mypy benchmarks/*.py
benchmarks/models/*.py` passes (28 files, 0 errors). Covers the real
type gaps — `SPEC: ModelSpec | None` in the conditionally-registered sos
/ piecewise models, narrowing `prov.import_dir` past the `failed_at`
guard in `memory sweep`, and return/arg annotations on the plotting
helpers — plus `-> None` / fixture-arg annotations across the phase
tests and conftest.

benchmarks/* stays in the mypy `exclude`, so this isn't enforced in CI;
it just makes an explicit `mypy benchmarks/...` run come back clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related cleanups after noticing bench had quietly re-derived the
snapshot contract and a thinner pytest-benchmark.

Extract a dependency-free benchmarks/snapshot.py that owns the format:
the two on-disk JSON shapes (write_timing_snapshot / write_memory_snapshot
/ load_snapshot), the test-id grammar (parse_test_id / synth_test_id),
and load_long_df. plotting, bench, and memory now all depend on it
instead of bench reaching sideways into plotting's private _parse_test_id
and three writers hand-rolling the same JSON. plotting shrinks to "plotly
views over a long DataFrame"; memory.save and bench.to_snapshot share one
writer; __init__'s load_long_df re-export drops its plotly-pulling path.

Rebase bench.time on timeit.Timer.autorange: calibrate an inner
iteration count so timer resolution stops dominating fast callables (the
old one-call-per-round loop was unstable in exactly that regime, and min
is the headline stat), then sample per-iteration time across rounds.
Records stats["iterations"]. Still explicitly not interchangeable with
suite numbers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the ~340-line per-version provisioning block and both sweep bodies
into a new benchmarks/sweep.py (provision_venvs + run_sweep /
run_memory_sweep). cli.py's `sweep` and `memory sweep` commands become
thin shims that resolve their options (phase -> test file, smoke args)
and delegate. No behavior change — command set, flags, and help text are
identical; verified with a live one-version smoke sweep.

Per the plan's Item B but adapted in two ways:
- The shared-discovery helper goes into the existing snapshot.py as
  discover_snapshots() rather than a new snapshots.py — a sibling module
  one plural away from snapshot.py would be a nasty import footgun.
  _suggest_snapshots (typer-coupled presentation) stays in cli.py and
  calls it.
- run_memory_sweep moves too (not just run_sweep), so the provisioning
  generator stays private to sweep.py instead of being imported across a
  module boundary; all three memory subcommands are now thin.

cli.py: 1218 -> 882 lines (the remainder is command signatures + help).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It duplicated the intro's pointer to `--help` and was the only
hand-maintained, unverified block in an otherwise all-executed
walkthrough. Discovery already routes through `python -m benchmarks
--help` (and `--help` on any subcommand), per the intro.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an "In Python — load straight from file" subsection to the diff
section: load the baseline/candidate snapshots with load_long_df, then
pivot to a one-column-per-snapshot DataFrame with a candidate/baseline
ratio. Demonstrates the programmatic path the CLI views sit on, for
custom analysis from file. Executes end-to-end under the CI notebook run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented May 29, 2026

Dogfooding: does this suite actually help during dev?

Short version: yes. I pointed it at a real, currently-unreleased linopy change — the CSR / freeze_constraints work ("30–120× faster matrix generation for direct solver APIs", "~10× faster to_highspy") — and it both refused a false positive and isolated the real effect.

What happened

  1. No false positive. A cross-version sweep 0.6.7 0.7.0 --phase solver_handoff came back flat (~1.0×). Correct — the CSR work isn't in 0.7.0 (it's unreleased), so the suite reported no change instead of inventing one.
  2. Why I used bench, not sweep (corrected — an earlier version of this comment blamed a highspy version mismatch; that was wrong). sweep pins highspy==1.13.1 into every per-version venv, so the solver is held constant across arms — there's no highspy confound within a sweep. The real blocker is that the CSR win is gated on the per-model flag freeze_constraints=True, and no registry model sets it (to_highspy doesn't auto-freeze — proven below: mutable 18 ms vs frozen 1 ms on the same linopy). So a stock sweep of 0.7.0 → master would run the mutable path in both arms and show ~nothing. Demonstrating it via sweep would mean adding a frozen-constraints model variant (plus a git/file spec for the unreleased "after" arm — sweep accepts those, and now labels the snapshot by the spec's ref/sha so e.g. git+…@<sha> writes a clean linopy-<sha>.json). bench was simply the quickest way to toggle freeze in-process: same linopy, same highspy, only the constraint representation differs.
  3. End-to-end killed the caching confound. Building the model once and timing only the handoff hides the freeze cost (it's paid eagerly at add_constraints). So I measured build-only and the fused build+handoff too — and the parts add up (mutable 24+18 ≈ 44, frozen 35+1 ≈ 37), a nice consistency check on bench itself.

Results

Representative numbers (Apple M-series, highspy 1.12, the sparse many-term balance model at n_buses=300). Absolute ms are machine-dependent; the ratios and their direction reproduce.

measurement mutable frozen (CSR)
to_highspy matrix construction 18 ms 1 ms ~16–21× faster
model build (fresh) 24 ms 35 ms ~1.7× slower (freeze paid upfront)
build + handoff, end-to-end 44 ms 37 ms ~1.2× faster, grows per extra handoff

The magnitude tracks terms-per-row: dense few-term carbon-management ≈ 2.9×, basic ≈ 1.7×, this 300-term model ≈ 20× — exactly matching "for constraints with many terms". Net: freezing trades ~11 ms upfront build for ~17 ms saved per handoff → a single build+solve already wins slightly, and anything that touches the matrix again (LP write, re-export, iterative re-solve) compounds the win. "Frozen is slower" only holds if you build and never hand off — not a real workload.

Reproduce

bench_csr.py — self-contained, seeded, no pypsa
import numpy as np
import pandas as pd
import xarray as xr

import linopy
import linopy.io as lio
from benchmarks import bench


def build_sparse(n_buses: int, *, freeze: bool) -> linopy.Model:
    rng = np.random.default_rng(42)
    n_lines, n_time = n_buses, min(n_buses, 24)
    buses = pd.RangeIndex(n_buses, name="bus")
    lines = pd.RangeIndex(n_lines, name="line")
    time = pd.RangeIndex(n_time, name="time")
    bus_from = np.arange(n_lines)
    bus_to = (bus_from + 1) % n_buses

    m = linopy.Model(freeze_constraints=freeze)
    gen = m.add_variables(lower=0, coords=[buses, time], name="gen")
    flow = m.add_variables(lower=-100, upper=100, coords=[lines, time], name="flow")
    incidence = np.zeros((n_buses, n_lines))
    incidence[bus_to, np.arange(n_lines)] = 1
    incidence[bus_from, np.arange(n_lines)] = -1
    incidence_da = xr.DataArray(incidence, coords=[buses, lines])
    demand = xr.DataArray(rng.uniform(10, 100, size=(n_buses, n_time)), coords=[buses, time])
    m.add_constraints(gen + (flow * incidence_da).sum("line") == demand, name="balance")
    m.add_objective(gen.sum())
    return m


def report(label, rs):
    by = {r.label: r.stats["min"] for r in rs.results}
    print(f"{label}: mutable={by['mutable']*1e3:.0f}ms frozen={by['frozen']*1e3:.0f}ms "
          f"-> frozen {by['mutable']/by['frozen']:.1f}x")


N = 300
mut, frz = build_sparse(N, freeze=False), build_sparse(N, freeze=True)

report("handoff   ", bench.compare({
    "mutable": lambda: lio.to_highspy(mut, set_names=False),
    "frozen":  lambda: lio.to_highspy(frz, set_names=False)}, rounds=20))
report("build     ", bench.compare({
    "mutable": lambda: build_sparse(N, freeze=False),
    "frozen":  lambda: build_sparse(N, freeze=True)}, rounds=20))
report("end-to-end", bench.compare({
    "mutable": lambda: lio.to_highspy(build_sparse(N, freeze=False), set_names=False),
    "frozen":  lambda: lio.to_highspy(build_sparse(N, freeze=True), set_names=False)}, rounds=20))
PYTHONPATH=. python bench_csr.py   # needs a linopy with Model(freeze_constraints=...)

So the suite catches real wins, refuses false ones, and attributes them to the right regime (terms-per-row). The reason this particular one needed bench rather than sweep is that the win is gated on a per-model flag (freeze_constraints) no registry model exercises — bench toggled it in-process on one linopy + one solver. Once CSR ships in a release (and/or a frozen model variant is added to the registry), a plain cross-version sweep would capture it directly.


This investigation and write-up (including the measurements above) were produced with Claude Code.

sweep named snapshots linopy-<version>.json with the raw version arg
interpolated. Fine for plain releases, but a git/file spec
(git+...@<sha>, linopy @ file://...) put slashes in the filename and the
snapshot write failed. Add _snapshot_label: for a spec with an @-ref take
the part after the last @ (sha/tag/branch), then sanitise to a safe path
segment. So git+...@<sha> -> linopy-<sha>.json (clean and reproducible);
plain releases are unchanged. Applied to both sweep and memory sweep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant