-
Notifications
You must be signed in to change notification settings - Fork 1
ci: populate baselines and gate benchmarks at +20% #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
clean6378-max-it
wants to merge
10
commits into
master
Choose a base branch
from
ci/benchmark-regression-gate
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
12bf70c
ci: populate baselines and gate benchmarks at +20%
clean6378-max-it b9d0707
fix: set benchmark baselines from ubuntu-latest CI means
clean6378-max-it 2482951
fix: harden benchmark gate scripts with input validation
clean6378-max-it b4bf963
fix: harden benchmark gate per PR #91 review
clean6378-max-it 6a6cb6e
fix: harden benchmark scripts against filesystem and type errors
clean6378-max-it 5d816d5
fix: allow reduce_baselines to run as a direct script
clean6378-max-it 001c5f0
fix: exclude noisy micro-benchmarks from regression gate
clean6378-max-it 0f55400
fix: ruff format check_benchmark_regression.pyfix: ruff format check_…
clean6378-max-it bdce5ae
fix: enforce EXCLUDED_FROM_GATE at check time and harden file reads
clean6378-max-it a79866f
fix: enforce EXCLUDED_FROM_GATE at check time and harden file reads
clean6378-max-it File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,3 +14,5 @@ node_modules/ | |
| .coverage | ||
| coverage/ | ||
| coverage.xml | ||
| benchmark-results.json | ||
| benchmarks/_raw.json | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| .PHONY: update-baselines check-benchmarks clean-benchmark-artifacts | ||
|
|
||
| update-baselines: | ||
| pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts= | ||
| python scripts/reduce_baselines.py benchmarks/_raw.json benchmarks/baselines.json | ||
|
|
||
| check-benchmarks: | ||
| pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmark-results.json -o addopts= | ||
| python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json | ||
|
|
||
| clean-benchmark-artifacts: | ||
| rm -f benchmarks/_raw.json benchmark-results.json |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,17 @@ | ||
| { | ||
| "_note": "Informational snapshot only — CI does not gate on these values.", | ||
| "updated": null, | ||
| "machine": null, | ||
| "_note": "Gated means from ubuntu-latest CI benchmark-results.json (post-cache PR #90). Excluded from gate: test_parse_session_small, test_search_full_corpus (sub-ms CI noise). Refresh via make update-baselines on ubuntu.", | ||
| "updated": "2026-06-17T21:00:00Z", | ||
| "machine": "Linux", | ||
| "groups": { | ||
| "parse": {}, | ||
| "export": {}, | ||
| "parse": { | ||
| "test_parse_session_medium": 0.002956, | ||
| "test_parse_session_large": 0.029678 | ||
| }, | ||
| "export": { | ||
| "test_bulk_export_session_count[sessions-10]": 0.004278, | ||
| "test_bulk_export_session_count[sessions-50]": 0.021144, | ||
| "test_bulk_export_session_count[sessions-100]": 0.042003 | ||
| }, | ||
| "search": {} | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| """Compare pytest-benchmark JSON output against stored baselines.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import json | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| THRESHOLD = 1.20 | ||
|
|
||
| # Sub-ms timings are too noisy for a fixed 20% gate on ubuntu CI. | ||
| EXCLUDED_FROM_GATE = frozenset( | ||
| { | ||
| "test_parse_session_small", | ||
| "test_search_full_corpus", | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| class BenchmarkDataError(ValueError): | ||
| """Raised when benchmark JSON input is malformed or missing required fields.""" | ||
|
|
||
|
|
||
| def load_results(results_path: str | Path) -> dict[str, float]: | ||
| path = Path(results_path) | ||
| try: | ||
| data = json.loads(path.read_text(encoding="utf-8")) | ||
| except OSError as exc: | ||
| raise BenchmarkDataError(f"cannot read {path}: {exc}") from exc | ||
| except json.JSONDecodeError as exc: | ||
| raise BenchmarkDataError(f"invalid JSON in {path}: {exc}") from exc | ||
| try: | ||
| benchmarks = data["benchmarks"] | ||
| except (KeyError, TypeError) as exc: | ||
| raise BenchmarkDataError(f"{path} missing top-level 'benchmarks' array") from exc | ||
| if not isinstance(benchmarks, list): | ||
| raise BenchmarkDataError(f"{path} 'benchmarks' must be an array") | ||
|
|
||
| results: dict[str, float] = {} | ||
| for index, entry in enumerate(benchmarks): | ||
| if not isinstance(entry, dict): | ||
| raise BenchmarkDataError(f"{path} benchmarks[{index}] must be an object") | ||
| try: | ||
| name = entry["name"] | ||
| mean = float(entry["stats"]["mean"]) | ||
| except (KeyError, TypeError, ValueError) as exc: | ||
| raise BenchmarkDataError( | ||
| f"{path} benchmarks[{index}] missing 'name' or 'stats.mean'" | ||
| ) from exc | ||
| name = str(name) | ||
| if name in results: | ||
| raise BenchmarkDataError(f"{path} duplicate benchmark name {name!r}") | ||
| results[name] = mean | ||
| return results | ||
|
|
||
|
|
||
| def load_baseline_means(baselines_path: str | Path) -> dict[str, float]: | ||
| path = Path(baselines_path) | ||
| try: | ||
| data = json.loads(path.read_text(encoding="utf-8")) | ||
| except OSError as exc: | ||
| raise BenchmarkDataError(f"cannot read {path}: {exc}") from exc | ||
| except json.JSONDecodeError as exc: | ||
| raise BenchmarkDataError(f"invalid JSON in {path}: {exc}") from exc | ||
| if not isinstance(data, dict): | ||
| raise BenchmarkDataError(f"{path} root value must be an object") | ||
|
|
||
| if "groups" not in data: | ||
| raise BenchmarkDataError(f"{path} missing required 'groups' key") | ||
| groups = data["groups"] | ||
| if not isinstance(groups, dict): | ||
| raise BenchmarkDataError(f"{path} 'groups' must be an object") | ||
|
|
||
| means: dict[str, float] = {} | ||
| for group_name, value in groups.items(): | ||
| if not isinstance(value, dict): | ||
| continue | ||
| for name, mean in value.items(): | ||
| name = str(name) | ||
| if name in means: | ||
| raise BenchmarkDataError(f"{path} duplicate benchmark name {name!r} across groups") | ||
| try: | ||
| means[name] = float(mean) | ||
| except (TypeError, ValueError) as exc: | ||
| raise BenchmarkDataError( | ||
| f"{path} groups[{group_name!r}][{name!r}] is not a numeric mean" | ||
| ) from exc | ||
| return means | ||
|
|
||
|
|
||
| def check_regression( | ||
| results_path: str | Path, | ||
| baselines_path: str | Path, | ||
| *, | ||
| threshold: float = THRESHOLD, | ||
| ) -> int: | ||
| """Return 0 when within threshold; 1 when any gated benchmark regresses.""" | ||
| flat = load_results(results_path) | ||
| baseline_means = load_baseline_means(baselines_path) | ||
|
|
||
| failures: list[str] = [] | ||
| for name, base in baseline_means.items(): | ||
| if name in EXCLUDED_FROM_GATE: | ||
| continue | ||
| cur = flat.get(name) | ||
| if cur is None: | ||
|
clean6378-max-it marked this conversation as resolved.
|
||
| print(f"WARN: no current result for baseline {name!r}; skipping") | ||
| continue | ||
| if base == 0: | ||
| print(f"WARN: baseline for {name!r} is zero; skipping ratio check") | ||
| continue | ||
| ratio = cur / base | ||
| tag = "FAIL" if ratio > threshold else "ok" | ||
| print(f"[{tag}] {name}: {cur:.6f}s vs {base:.6f}s ({ratio:.2f}x)") | ||
| if ratio > threshold: | ||
| failures.append(name) | ||
|
|
||
| for name in flat: | ||
| if name in EXCLUDED_FROM_GATE: | ||
| continue | ||
| if name not in baseline_means: | ||
| print(f"WARN: {name!r} has no baseline yet; not gated") | ||
|
|
||
| if failures: | ||
| print(f"\nREGRESSION: {len(failures)} benchmark(s) exceeded {threshold:.0%}") | ||
| return 1 | ||
| return 0 | ||
|
|
||
|
|
||
| def main(argv: list[str] | None = None) -> int: | ||
| parser = argparse.ArgumentParser(description=__doc__) | ||
| parser.add_argument("results_path", help="pytest-benchmark --benchmark-json output") | ||
| parser.add_argument("baselines_path", help="path to benchmarks/baselines.json") | ||
| parser.add_argument( | ||
| "--threshold", | ||
| type=float, | ||
| default=THRESHOLD, | ||
| help="fail when current mean exceeds baseline by more than this ratio (default: 1.20)", | ||
| ) | ||
| args = parser.parse_args(argv) | ||
| try: | ||
| return check_regression( | ||
| args.results_path, | ||
| args.baselines_path, | ||
| threshold=args.threshold, | ||
| ) | ||
| except BenchmarkDataError as exc: | ||
| print(f"ERROR: {exc}", file=sys.stderr) | ||
| return 2 | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| """Reduce pytest-benchmark JSON into benchmarks/baselines.json.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import json | ||
| import sys | ||
| from datetime import UTC, datetime | ||
| from pathlib import Path | ||
|
|
||
| try: | ||
| from scripts.check_benchmark_regression import EXCLUDED_FROM_GATE, BenchmarkDataError | ||
| except ModuleNotFoundError: | ||
| from check_benchmark_regression import EXCLUDED_FROM_GATE, BenchmarkDataError | ||
|
|
||
| GATED_GROUPS = ("parse", "export", "search") | ||
|
|
||
|
|
||
| def _positive_float(value: str) -> float: | ||
| parsed = float(value) | ||
| if parsed <= 0: | ||
| raise argparse.ArgumentTypeError("slack must be greater than zero") | ||
| return parsed | ||
|
|
||
|
|
||
| def reduce_baselines( | ||
|
clean6378-max-it marked this conversation as resolved.
|
||
| raw_path: str | Path, | ||
| out_path: str | Path, | ||
| *, | ||
| slack: float = 1.0, | ||
| ) -> dict[str, object]: | ||
| path = Path(raw_path) | ||
| try: | ||
| raw = json.loads(path.read_text(encoding="utf-8")) | ||
| except json.JSONDecodeError as exc: | ||
| raise BenchmarkDataError(f"invalid JSON in {path}: {exc}") from exc | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| except OSError as exc: | ||
| raise BenchmarkDataError(f"cannot read {path}: {exc}") from exc | ||
|
|
||
| try: | ||
| entries = raw["benchmarks"] | ||
| except (KeyError, TypeError) as exc: | ||
| raise BenchmarkDataError(f"{path} missing top-level 'benchmarks' array") from exc | ||
| if not isinstance(entries, list): | ||
| raise BenchmarkDataError(f"{path} 'benchmarks' must be an array") | ||
|
|
||
| groups: dict[str, dict[str, float]] = {group: {} for group in GATED_GROUPS} | ||
| for index, entry in enumerate(entries): | ||
| if not isinstance(entry, dict): | ||
| raise BenchmarkDataError(f"{path} benchmarks[{index}] must be an object") | ||
| try: | ||
| name = entry["name"] | ||
| mean = float(entry["stats"]["mean"]) | ||
| except (KeyError, TypeError, ValueError) as exc: | ||
| raise BenchmarkDataError( | ||
| f"{path} benchmarks[{index}] missing 'name' or 'stats.mean'" | ||
| ) from exc | ||
| group = entry.get("group") | ||
| if group not in GATED_GROUPS: | ||
| continue | ||
| if str(name) in EXCLUDED_FROM_GATE: | ||
| continue | ||
| groups[group][str(name)] = mean * slack | ||
|
|
||
| machine_info = raw.get("machine_info") | ||
| machine = machine_info.get("system") if isinstance(machine_info, dict) else None | ||
| output: dict[str, object] = { | ||
| "_note": ( | ||
| "Gated means from ubuntu-latest CI (post-cache). " | ||
| "Excluded from gate: test_parse_session_small, test_search_full_corpus (CI noise)." | ||
| ), | ||
| "updated": datetime.now(UTC).strftime("%Y-%m-%dT%H:%M:%SZ"), | ||
| "machine": machine, | ||
| "groups": groups, | ||
| } | ||
| out = Path(out_path) | ||
| try: | ||
| out.write_text(json.dumps(output, indent=2) + "\n", encoding="utf-8") | ||
| except OSError as exc: | ||
| raise BenchmarkDataError(f"cannot write {out}: {exc}") from exc | ||
| return output | ||
|
|
||
|
|
||
| def main(argv: list[str] | None = None) -> int: | ||
| parser = argparse.ArgumentParser(description=__doc__) | ||
| parser.add_argument("raw_path", help="pytest-benchmark --benchmark-json output") | ||
| parser.add_argument("out_path", help="destination baselines.json path") | ||
| parser.add_argument( | ||
| "--slack", | ||
| type=_positive_float, | ||
| default=1.0, | ||
| help="multiply means by this factor (must be > 0)", | ||
| ) | ||
| args = parser.parse_args(argv) | ||
| try: | ||
| reduce_baselines(args.raw_path, args.out_path, slack=args.slack) | ||
| except BenchmarkDataError as exc: | ||
| print(f"ERROR: {exc}", file=sys.stderr) | ||
| return 2 | ||
| return 0 | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.