benchmarks: reusable registry, new model types, new phases, CI smoke#735
benchmarks: reusable registry, new model types, new phases, CI smoke#735FBumann wants to merge 68 commits into
Conversation
…smoke Refactors the internal benchmark suite around a reusable ModelSpec / REGISTRY pattern so adding a model is one self-registering file with metadata (features, applicable phases, sizes, optional deps). Other tests and scripts can import it via `from benchmarks import REGISTRY`. New model specs cover gaps in the existing coverage: - milp: general (non-binary) integers (capacitated facility location) - qp: continuous quadratic objective (diagonal portfolio) - sos: SOS1 multi-mode generation (Model.add_sos_constraints) - piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation) - masked: sparse-route transportation using mask= on add_variables SOS and piecewise specs gate their own registration on API availability, so the suite stays runnable on older linopy. New phase tests: - test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/ to_mosek/to_xpress across applicable models, skipping per-solver when the solver isn't installed. Uses stable lp.io wrappers (not the new Solver.from_name API) for backward compatibility. - test_netcdf.py: separate to_netcdf / read_netcdf benchmarks. CI: new benchmark-smoke.yml runs the suite under --quick --benchmark-disable on PRs, so refactors that break a model spec get caught early. Timings stay off CI (~35s smoke locally, no regression tracking). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full timing pass completes in ~2 minutes instead of 20-45. ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``: - ``--quick`` → ``size <= quick_threshold`` (CI smoke) - default → ``size <= long_threshold`` (medium-cost regression) - ``--long`` → no cap (full sweep) Verified locally: - --quick: 227 passed / 230 skipped / 35s - default: 333 passed / 124 skipped / 45s - --long : 457 passed / 0 skipped / 2m13s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA import + example loading dominates the smoke wall-clock; the model still runs in default and --long modes. - Lower every other spec's quick_threshold to its smallest size, so --quick exercises one size per model across all phases. The default tier (which uses long_threshold) still gives broad regression coverage. Verified locally: - --quick: 85 passed / 372 skipped / 18.5s (was 35s) - default: 333 passed / 124 skipped / 44.8s (unchanged) - --long : 457 passed / 0 skipped / 2m11s (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
benchmarks/notebooks/registry_usage.py is the canonical walkthrough for the model registry. Authored in jupytext percent format so it triples as: - runnable Python script (CI executes it on every PR) - notebook in JupyterLab / VSCode with the jupytext extension - readable doc on GitHub (markdown cells render directly) Covers: import, lookup by name, iterate, filter_by feature/phase, parametrize-your-own-pytest pattern, one-off tracemalloc profiling, and the three CLI tiers. CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step right after the pytest smoke — so doc rot fails the build instead of hiding until someone next opens the file. README: new "Worked walkthrough" subsection points at the notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace jupytext-style ``registry_usage.py`` with a proper ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb, nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``. - Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table) to ModelSpec. Visible in pytest -v output, in interactive Python, and as rich HTML in Jupyter cells. - Notebook simplified to lean on the new reprs: explicit-attribute prints in sections 2-5 replaced by bare expression evaluations. - README points at the .ipynb and notes the "launch jupyter from repo root" convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``python -m benchmarks <command>`` with typer subcommands:
- list / show / filter — introspect the registry
- smoke — pytest --quick --benchmark-disable (CI)
- run [--long --phase --model --filter --json]
— pytest --benchmark-only with knobs
- notebook — execute the registry-usage notebook
- memory save/compare — replaces the argparse main in memory.py
Modern typer style throughout: Annotated[...] for every parameter,
Literal[...] for the --phase choice, function docstrings for command
help. ``--help`` is auto-generated and is the source of truth — README
and the notebook just point at it instead of duplicating the menu.
CI smoke now calls ``python -m benchmarks smoke`` and
``python -m benchmarks notebook``. memory.py keeps its save/compare
functions but loses the argparse layer. typer added to the [benchmarks]
extra.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups after checking typer's docs: - Pin typer to the latest release (==0.26.2) in the [benchmarks] extra, so the CLI's behaviour is reproducible across dev / CI / contributor machines. - Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument to the idiomatic ``typer.Context`` + ``context_settings`` pattern (allow_extra_args, ignore_unknown_options). With the old style, any trailing ``--flag`` would be parsed as an unknown option and rejected; with ctx.args, ``python -m benchmarks run --long -- --tb=short -x`` actually works. Other patterns already match typer's recommended style: Annotated[...], Literal for choice params, docstrings for command help, sub-apps via add_typer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of pinning for stable measurement: - ``[benchmarks]`` extra in pyproject pins the test infra exactly (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4, nbconvert, typer). Loose enough that the sweep workflow can install varying linopy versions on top. - ``benchmarks/requirements.lock`` is the full transitive resolution (numpy, scipy, pandas, xarray, plus everything else). Generated via ``uv pip compile --no-emit-package linopy`` so the lockfile pins the *environment around linopy* without pinning linopy itself — that lets the same lockfile work for both current-tip regression runs and cross-version sweeps. README clarifies that the lockfile gives consistency over time on the same machine, not absolute reproducibility across machines (CPU / cache / memory bandwidth still matter). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv per version with uv, installs the benchmark infra (lockfile by default, or the [benchmarks] pinned subset with --no-use-lock) plus the target linopy in a single resolution pass, and runs the suite. Snapshots land in ``<output-dir>/linopy-<version>.json``. Useful for bootstrapping a perf history against published linopy releases. The current benchmark code runs against each linopy version (constant measurement layer); the ``_API_AVAILABLE`` gates on sos / piecewise specs make older linopy versions skip those phases gracefully. Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark run). Plain releases (0.4.0) and pip specs (git+https://...) both work via the ``_linopy_install_spec`` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README previously duplicated content from three sources: - the notebook (models table, registry-usage code blocks) - ``--help`` (quick-reference command list) - a stale memory.py invocation (since replaced by ``memory save/compare``) After the consolidation each surface has a clear single job: - README: 1-paragraph what, setup (uv sync / lockfile), size-tier table (architectural), pointers to the notebook + ``--help``, metrics blurb. - ``notebooks/registry_usage.ipynb``: the walkthrough — registry import, lookup / iterate / filter, parametrize your own pytest, profiling. - ``python -m benchmarks --help``: command reference, autogenerated by typer from docstrings / Annotated[..., Option(...)] declarations. Drops ~140 lines from the README; nothing actually disappears — it just lives in the one place that owns it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pypsa removed from the [benchmarks] pinned set, from the sweep
``--no-use-lock`` install list, and from the lockfile. The
``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip``
so collection no longer fails without pypsa; ``pypsa_scigrid`` already
had ``requires=("pypsa",)`` so its phase tests skip gracefully.
Install pypsa separately when you want those benchmarks.
Notebook (registry_usage.ipynb) rewritten as a proper operator guide:
- Architecture overview + per-phase measurement table up front.
- Registry walkthrough (lookup / iterate / filter) kept as the spine.
- Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check).
- ``Running`` section now embeds ``--help`` output live via a
``show_help()`` helper that shells out to ``python -m benchmarks ...
--help``. The doc stays in sync with the typer implementation
automatically — change a flag in cli.py, re-run the notebook,
documentation updates.
- New sections cover timing snapshots, memory snapshots, the
cross-version sweep, and lockfile regeneration.
README gains an explicit "pypsa is optional" note in setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough
Mirrors ``run``'s filter knobs and applies them to every version's
pytest invocation. Also switches to the ``typer.Context`` +
``context_settings`` pattern so trailing args after ``--`` are
forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use).
python -m benchmarks sweep 0.6.7 --phase build --model basic
python -m benchmarks sweep 0.6.7 -- --tb=short -x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks compare a.json b.json [-- --columns=...]`` shells out to ``pytest-benchmark compare`` so the whole suite stays under one entry point. Accepts any number of snapshots; first is the baseline. When called with no arguments — or with paths that don't exist — it prints a copy-paste-ready list of snapshots found under ``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version runs). If nothing's saved yet, points at the ``run --json`` flow. For memory snapshots use ``memory compare`` instead — different format, different tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ame) pytest-benchmark's own default emits 10 columns side-by-side, which is unreadable for any non-trivial comparison. Wrapper now prepends ``--columns=median,iqr --sort=name`` so the table is two stats wide and the (baseline, candidate) pair of each test sits together alphabetically. Defaults are only applied when the user hasn't already set the flag, so trailing pass-through overrides still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arg split Two fixes for the ``compare`` UX surfaced by the cross-version sweep: - Default to ``--group-by=fullname`` so each test gets its own mini table showing (baseline, candidate) side-by-side with the parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for regressions in the median column. Combined with the existing ``--columns=median,iqr --sort=name`` defaults, the output goes from 10-columns-wide-on-one-line to a focused two-column per-test view. - Switch ``compare`` away from a positional ``list[Path]`` argument and parse ``ctx.args`` by hand: typer's positional list was greedily grabbing trailing ``--group-by=fullname`` etc. (and the ``--`` separator didn't escape it either). Now arg-splitting is explicit: anything starting with ``-`` is pytest-benchmark pass-through, everything else is a snapshot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried switching to the canonical typer pattern (``--`` separator for pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args`` still greedily ate the trailing options. There's no clean typer/click idiom for "list-typed positional + pass-through" — workarounds are manual splitting, bounding the positional count, or named flags. Manual splitting is the most pragmatic: snapshots come first, once we see any flag-like token the rest is forwarded to pytest-benchmark. That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``, and the value-taking flags whose value doesn't start with ``-``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three opinionated interactive HTML views over pytest-benchmark JSONs, auto-picked from snapshot count or set explicitly via ``--view``: - **compare** (2 snapshots) — horizontal bar chart of per-test median delta, sorted by magnitude, green→red colormap. The "did this PR regress anything?" picture in one glance, vs pytest-benchmark's 60-individual-SVGs which are useless for that workflow. - **sweep** (3+ snapshots) — heatmap of median ratio relative to the first snapshot, rows = tests, columns = labels. Pairs with the ``sweep`` subcommand. - **scaling** (1 snapshot) — log-log median vs ``n`` for size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by phase. Shows whether linopy's complexity scales as expected. plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is lazy-imported inside ``plot`` so the rest of the suite stays usable without it (with a clear error if a user tries ``plot`` and it's missing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er absolutes
- New ``benchmarks/plotting.py`` module owns the three views
(``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a
``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports
``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the
view functions so importing the module without plotly works.
- ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto``
so values render inside each bar / cell.
- Hover info upgraded:
- compare hover shows the per-test median of *both* snapshots
(formatted to 4 significant figures) in addition to the delta %.
- sweep hover shows the absolute median (s) alongside the ratio, via
a customdata + hovertemplate plumbed through ``update_traces``.
scaling view already shows the absolute median on hover by virtue of
being a line chart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ride For microbenchmarks the lowest observed time is closest to the "true" cost — background noise (GC, context switches, thermal throttling) can only slow things down. pytest-benchmark's own ``--sort`` default is ``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark and Alexandrescu's "Speed is found in the minds of people" all argue similarly. Changes: - ``plot`` defaults to ``--metric min`` (was median). Accepts ``--metric median|mean|max`` to override. The metric drives the bar values, heatmap ratios, scaling-curve y-axis, and the hover labels. - ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg. - ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was median,iqr / name). The auto-ratios next to each ``min`` flag regressions in the same readable form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For a suite where test costs span six orders of magnitude (knapsack microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta overweights cheap tests — a 100% regression on a 1µs test ranks above a 1% regression on a 2s test, but the absolute impacts are 1µs vs 24ms. Changes: - Default sort is now ``absolute`` (``b - a`` in seconds). Bar values are the time delta with SI-prefix formatting on the x-axis (24 ms, 240 µs, etc.). Big actual-time impacts float to the bottom. - ``--sort relative`` keeps the old percent behaviour. - Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless of which one drives the sort, so you can read off whichever lens. - ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform signature but ignore it (no two-snapshot diff there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compare bar chart forces a choice between sorting by relative % or absolute Δ. Both have blind spots: pure-relative makes microbenchmark noise look catastrophic, pure-absolute hides real algorithmic regressions on fast paths. The two-axis scatter resolves the tension visually. Per test: - x = baseline time (log scale) - y = candidate / baseline ratio - colour = absolute Δ A point is a real regression worth chasing only when it sits in the top-right — slow tests that got slower. Top-left (high ratio, tiny absolute) reads as microbenchmark noise; bottom-right (high absolute, ratio ≈ 1) was already slow and didn't change. A dashed reference line at ``y=1`` makes "no change" trivial to see. The view is auto-picked for nothing (compare wins for 2 snapshots); pass ``--view scatter`` explicitly to get it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-axis scatter now scales beyond a single baseline-vs-candidate pair. With N>=3 inputs the first is still the baseline (reference); each subsequent snapshot becomes one animation frame. Use the slider / play button to scrub through versions and watch tests drift across releases. Implementation: - First snapshot is the baseline. Skipped from the frame set (would trivially be y=1 everywhere). - Each subsequent snapshot contributes points at (baseline_time, ratio, Δ) per overlapping test. ``animation_frame="version"`` does the per-frame slicing; ``category_orders`` preserves input order in the slider so the timeline reads left→right. - ``range_x`` / ``range_y`` are pinned to the global min/max so the camera doesn't jump between frames. - 2 inputs still produces a static scatter (no animation overhead). Considered ``facet_col`` but it gets cramped past ~4 versions — the slider scales to arbitrary length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… |Δ| Two small but high-value tweaks for the multi-snapshot scatter: - The baseline snapshot now contributes its own animation frame where every point sits at ratio=1, Δ=0. Gives the animation a "before anything happened" anchor: hit play and watch points drift from the baseline horizon outward. Previously the first frame was the first candidate, which made the visual feel as if it started mid-story. - ``range_color`` is pinned to the 95th-percentile absolute Δ (±p95). One huge outlier no longer drags the colour scale and flattens everyone else to white; outliers saturate at the bound, the rest of the distribution stays readable. Colour-bar label notes ``Δ (s, p95-clipped)`` so the convention is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no change" line sits at y=1, but with asymmetric data (e.g. some 2x regression, no symmetric speedup), it landed near the bottom of the visible range and improvements got squeezed near the floor. Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 - max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the window is symmetric around 1.0, so regressions above and improvements below are equally readable regardless of the data skew. The colour scale keeps the p95-clipped centred-at-0 treatment from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h warn Three safe fixes from a code review of benchmarks/plotting.py: - Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with the floor bumped from 400 to 500. At 25+ tests the y-axis labels were colliding; now they breathe. - plot_scaling reads ``params.size`` (the cleanly-stored int from parametrize) and only falls back to the id regex if absent. The ``model`` name still needs the regex because pytest-benchmark serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a full params switch isn't possible here — but the size path is now robust to test-id rename. - plot_compare surfaces the mismatch between snapshots: prints a stderr line with the test counts only in A / only in B / common, and embeds the same as a subtitle in the figure. Silent intersection was the worst-case footgun. Skipped (per review note): the default-view swap for 3+ snapshots (sweep → scatter) is a judgement call left for the user. Default output filename change (clobber on each run) also skipped — they want to decide whether per-view filenames are worth the API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view>.html Two coupled changes setting up the notebook-embedding path: - ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure`` instead of writing to disk + returning a count. The CLI does the ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)`` is exported as a helper so the CLI still emits a "N points → path" status line. This unblocks rendering plots directly in jupyter — call ``plot_compare(...)`` and Jupyter's display hook renders the Figure inline. - Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html`` (was ``benchmark-plot.html`` in cwd). Matches where snapshots already land (and is gitignored), and the per-view filename means consecutive runs of different views don't clobber each other. Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max() or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) — both triggered ``ValueError: The truth value of an array with more than one element is ambiguous``. Replaced with explicit ``is None`` checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ospection The ``n_points(fig)`` helper added in the previous commit walked ``fig.data`` traces and called ``len(trace.x)`` to recover the test count. That's backwards — the count is sitting right there in the source DataFrame at render time, no need to reach into the rendered plot. Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for compare / sweep / scaling; ``df["test"].nunique()`` for scatter (rows are per-(test, version) so the raw len double-counts). n_points helper dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two refinements to the end-to-end plotting section: - tqdm wraps the subprocess loop that generates the two snapshots. Each ``--quick --phase build`` run takes ~10 s; tqdm makes the ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget vs console bar based on context. - Plots are now rendered via ``python -m benchmarks plot --view <name>`` rather than direct ``plot_compare`` / ``plot_scatter`` imports. A small ``cli_plot(view, snapshots)`` helper runs the subprocess, reads the generated HTML, and inlines it via ``IPython.display.HTML``. Demonstrates the actual user-facing CLI path inside the notebook rather than the internal API. Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting overhead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r row Three coupled fixes after --facets model produced empty bars across every facet: 1. Phase-aware short y-labels. Facet by phase → ``model-n=N``; facet by model → ``phase-n=N``. The facet header already encodes the other dimension. 2. Independent y-axes per facet (``matches=None``). Each facet's y-axis lists only its own categories — no empty rows for tests that belong to other facets. 3. Shared y-tick labels per row via ``_hide_non_leftmost_yticks``. Hidden on every facet except the leftmost column of the wrap grid so labels appear once per row, not at every subplot's edge. 4. Per-facet height calculation. Without faceting we sized for total bar count; with independent per-facet category sets we now size for ``max-rows-in-any-facet × n-facet-rows``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small UX wins for plot consumers: - Default view for 2 snapshots is now ``scatter`` (was ``compare``). Each test sits at its own ``(baseline_time, ratio)`` coords with hover labels — no aggregation, so size-semantics mismatches across models (n=100 for ``basic`` vs ``pypsa_scigrid``) don't muddy the picture. ``compare`` (delta bars) is still one ``--view compare`` away. - Re-export ``load_long_df`` from ``benchmarks`` so callers can grab the tidy DataFrame in one line without importing the plotting module. ``df, unit = load_long_df([Path(...), Path(...)])``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… pins Three pieces, one PR: - ``memory sweep <versions>...`` mirrors the timing sweep but runs ``memray.Tracker`` measurements inside each per-version uv venv. Output lands at ``.benchmarks/memory/linopy-<ver>.json`` and ``plot`` auto-detects the ``peak_mib`` shape. - Symmetric override flags so cross-version sweeps can use uniform measurement counts: ``--rounds N`` on ``sweep`` / ``run`` forces ``--benchmark-min-rounds=N --benchmark-max-time=0`` (default is pytest-benchmark's per-test auto-tuning), and ``--repeats N`` on ``memory sweep`` / ``memory save`` takes min-of-N peak per measurement (default 1; memory peaks are mostly deterministic). - Single source of truth for sweep dep pins. Both sweeps now read the ``[benchmarks]`` extra from ``pyproject.toml`` at runtime via ``_benchmarks_extra_pins()`` instead of duplicating two short pin lists in ``cli.py``. Bumping the extra propagates automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-PR cachegrind-based regression detection plus a feedback loop that attributes upstream dep perf changes to specific Dependabot bumps: - New ``codspeed`` job in ``benchmark-smoke.yml``: runs ``pytest --quick --codspeed`` on every PR via the CodSpeedHQ action. Needs a ``CODSPEED_TOKEN`` repo secret to post comments; without it the job fails gracefully. - ``--quick`` now skips PyPSA end-to-end via a collection hook in ``conftest.py``. The PyPSA network is ~30s native; cachegrind would make it minutes, and the signal CodSpeed is meant to catch lives in the micro paths. - Pin tiering in ``[benchmarks]``: perf-relevant deps (``numpy``, ``scipy``, ``xarray``, ``pandas``, ``polars``, ``dask``, ``highspy``, ``netcdf4``) get individual ``==`` pins so each Dependabot bump produces one attributed CodSpeed delta. Tooling deps (``pytest`` & plugins, ``nbconvert``, ``typer``, ``plotly``) are also pinned but grouped in ``dependabot.yml`` so they batch into a single PR. - Loose ``[project.dependencies]`` stays untouched — downstream linopy consumers keep their existing resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
It went so well that we can consider wiring up https://codspeed.io/. I would like to test it. |
- Drop ``benchmarks/requirements.lock`` and the ``--use-lock/--no-use-lock``
toggle on ``sweep`` and ``memory sweep``. The ``[benchmarks]`` extra in
pyproject already pins every measurement-relevant direct dep
(numpy/scipy/xarray/pandas/polars/dask/highspy/…) — uv resolves the same
pyproject input deterministically into each per-version venv, so "same
deps, only linopy varies" comes for free without maintaining a separate
lockfile.
- Relocate the walkthrough out of ``benchmarks/notebooks/`` to
``benchmarks/walkthrough.md`` (next to README), and delete the now-empty
notebooks/ directory along with the obsolete ``registry_usage.ipynb``.
- Replace ``jupyter nbconvert`` with Jupytext for the walkthrough. The
``notebook`` subcommand now executes the ``.md`` directly; ``--build``
regenerates a gitignored sibling ``walkthrough.ipynb`` for editor-agnostic
viewing (one-way conversion — no bidirectional pairing, so PyCharm/VSCode
work the same as JupyterLab). Add ``jupytext==1.17.4`` to the
``[benchmarks]`` extra.
- Condense ``benchmarks/README.md`` from 128 → 45 lines: scope/install/lockfile
rationale + walkthrough launch only. Phase coverage, CLI surfaces, metric
rationale, memory commands, and extending guide are now load-bearing in
the walkthrough.
- ``list --details`` and ``show`` now use ``typer.secho`` for dim headers /
cyan spec names / dim attribute labels (auto-strips when piped — ``list |
grep`` stays clean).
- Drop the "microbenchmark" framing where it overreached: the suite has
millisecond-to-second-scale tests, not microbenchmarks. Rephrase the
scatter-quadrant prose ("cheap tests with big ratio swings — noise, not
real change") in plotting.py, cli.py, and the README metric rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end walkthrough of the benchmarks CLI: registry introspection (``list --details``, ``show``, ``filter``), a two-snapshot regression workflow (``run --quick --phase build`` → ``compare`` table → ``plot --view scatter`` / ``--view compare`` rendered inline via ``IPython.display.HTML``), peak-RSS snapshots (``memory save`` / ``memory compare``), an "other CLI surfaces" reference table, and the "add a new model" three-step recipe. The file is the load-bearing documentation for the suite — README only covers install and how to open it. CI executes it on every PR via ``python -m benchmarks notebook`` so the examples can't silently rot. Contributors regenerate the gitignored ``walkthrough.ipynb`` sibling via ``python -m benchmarks notebook --build`` and open it in JupyterLab, PyCharm, or VSCode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous pin (1.17.4) was a guess; the resolved version in the dev environment is 1.19.3. Align pyproject so per-version sweep venvs install the same jupytext the local dev env uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a ``--smoke`` flag to ``sweep`` that runs the same pytest invocation as the top-level ``smoke`` command in each per-version venv: every model/phase fires once at the quickest size, no timings, ~10–20 s per version. Useful before bumping a perf-sensitive pin like ``numpy`` to confirm every linopy version we'd sweep against still installs, imports, and exercises the suite cleanly. Surfaces real binary-compat issues (e.g. a ``netcdf4`` wheel mismatched against the pinned ``numpy``) that declared-constraint resolution can't catch on its own. The smoke pytest args are now a shared ``_SMOKE_PYTEST_ARGS`` constant so the top-level command and ``sweep --smoke`` stay in sync — single source for the definition of "smoke." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Delete the module-level ``__iter__`` in registry.py — module-level ``__iter__`` is uncallable (you can't ``for x in module``), so it was dead code. Drops the unused ``Iterator`` import too. - Rename ``memory.DEFAULT_PHASES`` → ``MEMORY_PHASES``. It collided with ``registry.DEFAULT_PHASES`` (frozenset of 9 phase tags) — same name, different shape, both imported elsewhere. Footgun for the next reader. - Rewrite the stale comment above the ``[benchmarks]`` pyproject extra. It said "Not pinned here: numpy / scipy / pandas / xarray" directly above the lines pinning exactly those four, and referenced the ``requirements.lock`` we just deleted. Replace with the actual story: every measurement-relevant direct dep is pinned; sweep installs the same set into each per-version venv. - Rename the CI smoke step from "Execute registry-usage notebook" to "Execute walkthrough notebook" to match the file the ``notebook`` command now executes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orts
Two related dead-surface trims flagged in code review:
- ``SOLVER_BUILD`` was an aspirational phase tag ("generic
Solver.from_name(..., io_api='direct')") that was declared, exported,
listed in ``DEFAULT_PHASES``, and added to ``sos``'s phase set — but
never wired into any test. ``test_solver_handoff.py`` only exercises
``TO_HIGHSPY``/``GUROBIPY``/``MOSEK``/``XPRESS``. Remove it from
registry.py, ``__init__.py`` exports, and the sos spec. If we ever
want the generic ``from_name(..., io_api='direct')`` path measured,
it can come back as a real phase with a real test.
- ``benchmarks/models/__init__.py`` was re-exporting 20 names
(``BASIC_SIZES``, ``build_basic``, …) that nothing outside the
``models`` package referenced. The documented access path is
``REGISTRY["<name>"]``; the only thing ``__init__.py`` needs to do
is trigger each submodule's ``register(...)`` side-effect. Collapse
to a single ``from benchmarks.models import basic, …`` import block.
Adding a new model is now one new file plus one line in this block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the *what each phase does to a model* logic from the pytest test files and ``memory.py`` into a single ``benchmarks/phases.py`` module. Both drivers now import the same verbs — pytest wraps them in ``benchmark(...)``, memray wraps them in ``Tracker(...)`` — so the measured operation is defined once. The drift risk this targets is silent: if ``test_matrices`` grows a new matrix attribute and ``memory.py``'s inline copy doesn't, the timing/memory snapshots end up measuring different operations and ``plot`` shows non-overlapping sets, no error. Likewise the solver list (currently 4 wrappers) was duplicated between ``test_solver_handoff.py`` and a hardcoded ``to_highspy`` in memory. Touchpoints: - ``benchmarks/phases.py`` (new): ``touch_matrices``, ``write_lp`` (with ``progress=False`` pinned here), ``write_netcdf``, ``read_netcdf`` re-export, and a ``SOLVER_HANDOFFS`` tuple of ``(solver_name, registry_phase_tag, wrapper)``. - ``test_matrices.py``: drops the local ``_access_matrices``. - ``test_lp_write.py``: uses ``write_lp`` (pin lives in one place now). - ``test_netcdf.py``: uses ``write_netcdf`` + ``read_netcdf`` from phases. - ``test_solver_handoff.py``: ``_SOLVER_PHASES`` becomes ``SOLVER_HANDOFFS`` from phases; ``_make_params`` loop unchanged. - ``memory.py``: inline matrices/lp_write/netcdf bodies replaced with ``touch_matrices`` / ``write_lp`` / ``write_netcdf`` / ``read_netcdf`` from phases. The solver-handoff branch now looks up ``"highs"`` by name in ``SOLVER_HANDOFFS`` rather than ``[0]`` — reordering the tuple no longer silently swaps which solver gets measured. The id seam — memory.py's hand-rolled ``f"...::test_X[name-n=size]"`` strings vs pytest's collected node ids — is intentionally not abstracted (the netcdf double-emit and the ``highs-`` solver prefix make a shared id generator more framework than it's worth). Instead, a new ``benchmarks/test_memory_id_alignment.py`` exercises both sides for one cheap spec and asserts every memory-emitted id is in pytest's collection. A test rename now fails this guard immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``sweep`` and ``memory sweep`` had ~80 lines of near-identical per-version
plumbing: uv availability check, ``=== linopy X ===`` banner, tempdir +
uv venv creation, install pass with the same args, ``PYTHONPATH`` setup,
and parallel failure-reason printing. Two copies that would drift the
moment one side gained anything (a wheel-cache flag, a constraint pin, a
different stderr handler).
Extract a ``_provision_venvs(versions, tmp_prefix)`` generator that
yields one ``_ProvisionedVenv`` record per version. On success, the
record carries ``python`` + ``env``; on failure, ``failed_at`` names the
step that broke ("venv" or "install") and the caller skips its
per-version action. Each tempdir cleanup happens when the generator
advances, so ``break``-ing out of the caller's loop still tears down
cleanly via the generator close protocol.
After the extraction:
- ``sweep`` shrinks from ~135 lines of venv plumbing + action to ~70
lines of just the action (smoke pytest invocation vs full
``--benchmark-only`` invocation + snapshot check).
- ``memory sweep`` shrinks similarly — only the ``memory save``
invocation and the snapshot-relocation bookkeeping remain.
- Future sweep flavours get the venv plumbing for free.
No user-facing behaviour change; the failure messages and the banner
output are identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous pin was on the last numpy 1.x release; 2.x has been the default for ~2 years now and is what most linopy users actually run. Pinning benchmarks to 1.26.4 meant the suite was measuring a code path nobody hits anymore. Verified safe via ``sweep --smoke`` across the realistic sweep set (0.5.8, 0.6.0, 0.6.7, 0.7.0) — every linopy version installs, imports, and exercises the suite (every model build / phase fire) cleanly against numpy 2.4.6. The pre-existing ``netcdf4`` binary-incompat warning (``numpy.ndarray size changed``) is unchanged by this bump — it's a wheel-vs-ABI mismatch from ``netcdf4==1.7.4`` that's present under both numpy 1.26.4 and 2.4.6, doesn't fail any test, and is a separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical sweep --smoke verification: - linopy <0.5.1 declares ``numpy<2.0`` in its metadata, so any numpy 2.x ``==`` pin in our ``[benchmarks]`` extra makes uv refuse to resolve those versions (install fails before any code runs). - Relaxing to ``numpy<2.0`` lets the older versions install and run. Verified: 0.3.15, 0.4.0, 0.4.4, 0.5.0, 0.7.0 all pass ``sweep --smoke`` under the relaxed pin. uv resolves ``numpy<2.0`` to 1.26.4 on every current platform (the last numpy 1.x release; numpy is done with the 1.x line), so the practical reproducibility property of "every per-version venv gets the same numpy" is preserved despite the looser-looking constraint. Reverts the bump in 7d3e474. We'll go back to a 2.x ``==`` pin once we drop pre-0.5.1 from sweep coverage — at which point ``sweep --smoke`` is the right tool to re-verify, same way it found this floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Honour the ``==`` pin convention used by every other measurement-relevant dep in ``[benchmarks]``. ``numpy<2.0`` in the previous commit gave the same practical result (uv resolves to 1.26.4) but broke the "every direct dep is pinned exactly" property the surrounding pins rely on for reproducibility. Empirically verified ``sweep --smoke`` still covers the full 0.3.x → 0.7.0 range under the exact pin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…HANDOFFS
Two coupled bugs that were making ``sweep`` produce meaningless
cross-version timings.
**Bug 1: silent linopy shadowing.** ``_provision_venvs`` ran the
per-version pytest with the cwd inherited from the user's shell —
typically the repo root, which contains a ``linopy/`` package (the
one we're developing). Python prepends cwd to ``sys.path`` as
``''``, so ``import linopy`` resolved to the dev tree rather than the
venv's installed version. Every sweep run was measuring dev linopy
against itself; the per-version timings were noise on the same code.
(Previously the function also set ``PYTHONPATH=repo_root`` for
``import benchmarks``, which independently caused the same shadowing
even with a different cwd.)
Fix: create an isolated import root per version — a fresh tempdir
containing only a symlink ``benchmarks → repo_root/benchmarks``. The
sweep callers now run subprocesses with ``cwd=import_dir`` and no
``PYTHONPATH``. ``import benchmarks`` resolves via the symlink;
``import linopy`` falls through to site-packages → the requested
version. Added ``import_dir`` to ``_ProvisionedVenv`` and threaded it
through both ``sweep`` and ``memory sweep`` call sites (memory
discovery now looks under ``import_dir/.benchmarks/memory`` for the
``memory save`` output before moving it to ``output_dir``).
**Bug 2: SOLVER_HANDOFFS eagerly imports linopy.io.to_xpress, which
doesn't exist in any released linopy.** With shadowing in effect we
never noticed; after the isolation fix, even ``sweep --smoke 0.7.0``
fails collection because ``lio.to_xpress`` is an AttributeError.
Fix: build ``SOLVER_HANDOFFS`` via ``getattr(lio, name, None)`` and
filter out wrappers that aren't present in the installed linopy. The
tuple shape stays the same; older versions silently drop solvers
they don't support. ``memory.py``'s ``next("highs", ...)`` lookup
defaults to ``None`` and skips the solver_handoff memory phase rather
than emitting an unmatchable test id.
Consequence the user should expect: ``sweep --smoke`` against older
linopy versions now surfaces real install / runtime / API
incompatibilities rather than passing silently. Versions whose
metadata installs cleanly but whose code imports fail under our
pinned ``xarray`` / etc. will report ``smoke failed`` — that's the
correct signal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xarray 2025.3.0 moved ``xarray.core.rolling`` to a different location. linopy <=0.5.0 imports it directly, so the previous ``xarray==2025.9.0`` pin made any sweep against those releases fail at ``import linopy`` — real signal that was masked while sweep was silently running the dev linopy instead. Pin to the last release before the rename (2025.1.2). Coverage now extends down to 0.4.4 cleanly. The realistic floor is 0.4.4 — 0.4.0's ``to_file`` lacks the ``progress`` kwarg, and reaching back further would need version-specific shims that aren't worth maintaining. Verified: ``sweep --smoke 0.4.4 0.5.0 0.5.8 0.6.7 0.7.0 -k basic`` all green; local ``smoke`` still passes on the dev linopy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 0.2.0 linopy added the ``progress`` kwarg to ``Model.to_file`` in 0.4.1. The suite's ``write_lp`` verb passes ``progress=False`` to keep the progress-bar overhead out of the measurement, which means anything older than 0.4.1 raised ``TypeError`` and failed sweep smoke. Check once at import time (``inspect.signature``) whether the kwarg is present; if not, fall back to the native call. Branchless on the hot path — the check resolves once when phases.py loads. Empirically extends sweep coverage from 0.4.4 down to 0.2.0 with no other changes — roughly three years of historical releases now in scope. 0.1.x has further API drift (``add_variables`` signature) and 0.0.x has pre-pyproject metadata that uv can't install, both out of scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Direct pins in ``[benchmarks]`` keep results reproducible *within* one sweep call, but unpinned transitive deps can drift between sweep calls days apart — a delta could then come from a numpy patch release rather than the linopy change you wanted to attribute it to. Add an ``--as-of <DATE>`` flag to both ``sweep`` and ``memory sweep`` that passes ``--exclude-newer`` to uv. The entire transitive resolution is frozen to releases on or before the date; running the same sweep set + the same ``--as-of`` value at any later point reproduces the same dep tree (modulo PyPI yanking). Plumbed through ``_provision_venvs(as_of=...)`` so both call sites stay single-source. Default is unchanged — no ``--as-of`` ⇒ latest resolution, matching prior behaviour. Empirically verified: ``--as-of 2026-05-01`` correctly rejects the install when ``pytest-codspeed==5.0.3`` (released later) is in the pin set; ``--as-of 2026-05-29`` resolves cleanly and ``sweep --smoke`` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the symlink-isolation fix in e7f9c5b: 1. Preflight check per provisioned venv. After symlinking benchmarks/ into the import_dir, run a tiny ``python -c`` that imports linopy from the cwd we're about to use and asserts ``linopy.__file__`` is under the venv's prefix. If a future change reintroduces the dev-linopy shadow (PYTHONPATH=repo, missing PYTHONDONTWRITEBYTECODE side-effect, pytest import-mode bump, …), this fails loudly with "isolation leak: linopy resolved to <path>, not the venv" rather than silently corrupting every snapshot in the sweep. New ``failed_at`` value ``"isolation"`` lets callers record this the same way they already record venv/install failures. 2. ``PYTHONDONTWRITEBYTECODE=1`` in the subprocess env. The symlink resolves to the real benchmarks/ source tree, so every sweep subprocess would otherwise write fresh ``.pyc`` files into the user's working tree — harmless (Python is held constant so the bytecode is valid) but it mutates the checkout and would risk write contention if sweep ever becomes parallel. One env var keeps each run pure. Verified: shadowing simulated by re-setting ``PYTHONPATH=repo`` is now caught by the preflight with the expected assertion message; happy path ``sweep 0.7.0 --smoke`` still passes; ``benchmarks/__pycache__`` is untouched after a sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-version isolation root held a symlink to repo_root/benchmarks. Replace it with a filtered copy so the sweep runs on Windows (no symlink privilege needed) and no per-version subprocess — including its __pycache__ writes — can touch the working tree. Drops the now-redundant PYTHONDONTWRITEBYTECODE: with a copy, bytecode lands in the throwaway tempdir, so the working tree is structurally untouchable rather than protected by an env var. ignore_patterns skips the executed notebook and cruft to keep the copy cheap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`benchmarks.bench` times or memory-profiles any callable in-process on the current tree — a registry builder, a phase verb on a hand-built model, a one-off lambda — and returns a result that round-trips through the existing snapshot/plot machinery (`load_long_df`). Three entry points (`time`, `memory`, `compare`) plus `TimingResult` / `MemoryResult` / `ResultSet` with `to_snapshot` / `to_df` / rich Jupyter reprs. To make the memray peak-measurer reusable, memory.py no longer raises at import on Windows: the check moves into a `_require_memray()` called by each measuring entry point, and `_measure_peak` is promoted to public `measure_peak` (back-compat alias kept). bench reuses it for the memory path. Adds a "Benchmarking custom things" section to walkthrough.md (executes end-to-end under the CI notebook run) and re-exports `bench` from the package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotate the benchmark modules so `mypy benchmarks/*.py benchmarks/models/*.py` passes (28 files, 0 errors). Covers the real type gaps — `SPEC: ModelSpec | None` in the conditionally-registered sos / piecewise models, narrowing `prov.import_dir` past the `failed_at` guard in `memory sweep`, and return/arg annotations on the plotting helpers — plus `-> None` / fixture-arg annotations across the phase tests and conftest. benchmarks/* stays in the mypy `exclude`, so this isn't enforced in CI; it just makes an explicit `mypy benchmarks/...` run come back clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related cleanups after noticing bench had quietly re-derived the snapshot contract and a thinner pytest-benchmark. Extract a dependency-free benchmarks/snapshot.py that owns the format: the two on-disk JSON shapes (write_timing_snapshot / write_memory_snapshot / load_snapshot), the test-id grammar (parse_test_id / synth_test_id), and load_long_df. plotting, bench, and memory now all depend on it instead of bench reaching sideways into plotting's private _parse_test_id and three writers hand-rolling the same JSON. plotting shrinks to "plotly views over a long DataFrame"; memory.save and bench.to_snapshot share one writer; __init__'s load_long_df re-export drops its plotly-pulling path. Rebase bench.time on timeit.Timer.autorange: calibrate an inner iteration count so timer resolution stops dominating fast callables (the old one-call-per-round loop was unstable in exactly that regime, and min is the headline stat), then sample per-iteration time across rounds. Records stats["iterations"]. Still explicitly not interchangeable with suite numbers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the ~340-line per-version provisioning block and both sweep bodies into a new benchmarks/sweep.py (provision_venvs + run_sweep / run_memory_sweep). cli.py's `sweep` and `memory sweep` commands become thin shims that resolve their options (phase -> test file, smoke args) and delegate. No behavior change — command set, flags, and help text are identical; verified with a live one-version smoke sweep. Per the plan's Item B but adapted in two ways: - The shared-discovery helper goes into the existing snapshot.py as discover_snapshots() rather than a new snapshots.py — a sibling module one plural away from snapshot.py would be a nasty import footgun. _suggest_snapshots (typer-coupled presentation) stays in cli.py and calls it. - run_memory_sweep moves too (not just run_sweep), so the provisioning generator stays private to sweep.py instead of being imported across a module boundary; all three memory subcommands are now thin. cli.py: 1218 -> 882 lines (the remainder is command signatures + help). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It duplicated the intro's pointer to `--help` and was the only hand-maintained, unverified block in an otherwise all-executed walkthrough. Discovery already routes through `python -m benchmarks --help` (and `--help` on any subcommand), per the intro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an "In Python — load straight from file" subsection to the diff section: load the baseline/candidate snapshots with load_long_df, then pivot to a one-column-per-snapshot DataFrame with a candidate/baseline ratio. Demonstrates the programmatic path the CLI views sit on, for custom analysis from file. Executes end-to-end under the CI notebook run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dogfooding: does this suite actually help during dev?Short version: yes. I pointed it at a real, currently-unreleased linopy change — the CSR / What happened
ResultsRepresentative numbers (Apple M-series, highspy 1.12, the sparse many-term
The magnitude tracks terms-per-row: dense few-term carbon-management ≈ 2.9×, Reproduce
|
sweep named snapshots linopy-<version>.json with the raw version arg interpolated. Fine for plain releases, but a git/file spec (git+...@<sha>, linopy @ file://...) put slashes in the filename and the snapshot write failed. Add _snapshot_label: for a spec with an @-ref take the part after the last @ (sha/tag/branch), then sanitise to a safe path segment. So git+...@<sha> -> linopy-<sha>.json (clean and reproducible); plain releases are unchanged. Applied to both sweep and memory sweep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Overhaul of the internal benchmark suite
Brings
benchmarks/from a hand-wired pytest collection into atyper-driven suite with cross-version sweeps, per-PR CodSpeed CI, and
plotly views.
Core pieces
from benchmarks import REGISTRYexposes10 self-registering specs (LP / QP / MILP / SOS / piecewise / sparse
networks / PyPSA carbon) with
build(size)+ feature & phase tags.Adding a model is one file.
build,matrices,lp_write,netcdfround-trip,solver_handoff(highs/gurobi/mosek/xpress). Plus PyPSA end-to-end.
python -m benchmarks --helpcovers run / smoke / sweep/ compare / plot / notebook / memory{save,sweep,compare}.
sweep 0.6.7 0.7.0builds one uv venv perversion; same flow for memory via
memray.Tracker(modelconstruction excluded from the peak).
plot <snapshots>picksscaling(1),scatter(2, default), or
sweepheatmap (3+).--facets {phase, model}splits across subplots. Auto-detects timing vs memory JSON.
attributed regression detection on every PR.
[benchmarks]pinsperf-relevant deps individually (numpy/scipy/xarray/pandas/polars/
dask/highspy/netcdf4); tooling deps grouped into one monthly PR.
Each perf-relevant bump → one PR → attributed CodSpeed delta.
benchmarks/notebooks/registry_usage.ipynb,executed in CI so examples can't rot.
Sample output: 0.6.7 → 0.7.0
Default scatter view — each test at
(baseline_time, ratio), top-rightcorner = real regressions (slow tests that got slower):
Same comparison, memory dimension (peak RSS via
memray.Tracker):Per-phase delta breakdown (
compare --facets phase):Going forward
drop
CODSPEED_TOKENinto Actions secrets. Until then the CodSpeedjob fails but smoke + notebook execution still run.
benchmarks/models/, register it.Phase coverage and CLI surface follow automatically.
test_<phase>.pyparametrized off the registry.[benchmarks]is the perf-attribution surface;[project.dependencies]stays loose so downstream consumers see nochange.