Skip to content

ci: populate baselines and gate benchmarks at +20%#91

Open
clean6378-max-it wants to merge 10 commits into
masterfrom
ci/benchmark-regression-gate
Open

ci: populate baselines and gate benchmarks at +20%#91
clean6378-max-it wants to merge 10 commits into
masterfrom
ci/benchmark-regression-gate

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Closes #83

Summary

Adds a +20% benchmark regression gate to CI, populates benchmarks/baselines.json from a post-cache run, and documents how to refresh baselines.

Sprint item #6 (3 pt) — Wednesday PR 2 of 2. Depends on PR 1 (#90 session cache) so baselines reflect cached-path performance.

Problem

The CI benchmarks job ran pytest tests/benchmarks/ --benchmark-only and uploaded artifacts, but was labeled informational with no threshold. benchmarks/baselines.json had empty groups, so parse/export/search regressions passed silently.

Changes

scripts/check_benchmark_regression.py (new)

  • Compares pytest-benchmark JSON means against benchmarks/baselines.json
  • Fails when current_mean / baseline_mean > 1.20
  • Missing baseline for a new benchmark name: warn, do not fail
  • Missing current result for a baseline entry: warn and skip

scripts/reduce_baselines.py (new)

  • Reduces --benchmark-json output into baselines.json (parse / export / search groups only)
  • Optional --slack factor for cross-machine capture (e.g. faster local host vs ubuntu CI)

benchmarks/baselines.json

  • Populated with 7 gated benchmark means (3 parse, 3 export, 1 search)
  • Captured post-cache; cache group benchmarks are informational only (not gated)

CI (.github/workflows/ci.yml)

  • Job renamed: Performance benchmarks (informational)Performance benchmarks (gated)
  • New step after benchmark run:
    python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
  • Gate runs on ubuntu-latest only

Makefile (new)

  • make update-baselines — run benchmarks and regenerate baselines.json
  • make check-benchmarks — local gate check

Tests

  • tests/test_check_benchmark_regression.py — missing-baseline warn (pass), >20% regression (fail), within threshold (pass)

Docs / hygiene

  • benchmarks/README.md — gate behavior and baseline refresh instructions
  • .gitignorebenchmark-results.json, benchmarks/_raw.json

Acceptance criteria

  • benchmarks/baselines.json populated from post-cache run
  • CI fails on injected >20% regression; passes when green
  • Missing-baseline case warns without failing (tested)
  • make update-baselines regenerates baselines
  • Job renamed to gated label

Test plan

  • pytest -q — 388 passed
  • pytest tests/test_check_benchmark_regression.py -q
  • python scripts/check_benchmark_regression.py benchmarks/_raw.json benchmarks/baselines.json
  • ruff check .

Notes

  • Baselines were captured locally with --slack 1.35 for ubuntu headroom. If the gated CI job fails on first run due to runner variance, refresh on ubuntu via make update-baselines and recommit.
  • Branch protection admins may want to mark the gated benchmarks job as a required check.

Summary by CodeRabbit

  • New Features
    • Added a CI performance regression gate for selected benchmarks, failing builds when gated averages exceed stored baselines beyond the allowed threshold.
  • Chores
    • Introduced benchmark tooling to update baselines, run regression checks locally, and clean generated benchmark artifacts.
    • Updated CI to run regression checks before uploading benchmark artifacts.
  • Documentation
    • Refreshed benchmark README with gate rules and baseline refresh instructions.
  • Tests
    • Added coverage for baseline reduction and regression-gate behavior, including warnings and error handling.
  • Style
    • Updated ignore rules for generated benchmark output files.

Add check_benchmark_regression.py and reduce_baselines.py; populate
benchmarks/baselines.json from post-cache run; wire regression gate into
ubuntu benchmarks job; add Makefile update-baselines target and tests.
@clean6378-max-it clean6378-max-it self-assigned this Jun 17, 2026
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a benchmark regression gate to CI by introducing two new scripts (check_benchmark_regression.py and reduce_baselines.py) that implement ratio-based regression detection and baseline generation. benchmarks/baselines.json is populated with measured values from a post-cache Ubuntu run. The CI workflow gains a gated regression-check step, a Makefile provides update-baselines and check-benchmarks targets, and .gitignore and benchmarks/README.md are updated accordingly.

Changes

Benchmark Regression Gate

Layer / File(s) Summary
Regression check script implementation
scripts/check_benchmark_regression.py
Defines THRESHOLD = 1.20 and introduces BenchmarkDataError exception for malformed JSON. load_results() parses the benchmarks array with strict validation of name and stats.mean fields, converting means to floats. load_baseline_means() extracts baselines from a groups object and flattens them into name -> baseline_mean mappings. check_regression() computes ratio = current_mean / baseline_mean, prints [ok]/[FAIL] per benchmark, skips ratio checks when baseline is zero, warns (exit 0) for missing baselines or ungated benchmarks, and exits 1 if any gated ratio exceeds threshold. CLI entrypoint main() wires argparse-driven execution and converts BenchmarkDataError to stderr output with exit code 2.
Regression check tests
tests/test_check_benchmark_regression.py
Comprehensive pytest coverage including helpers for writing benchmark result and baseline JSON fixtures. Tests verify check_regression return codes and stdout messages for missing-baseline warnings (exit 0), regressions exceeding 20% (exit 1), within-threshold results (exit 0), zero-baseline skip behavior, exactly-at-threshold success, and missing current results. Validation tests confirm malformed JSON and missing benchmarks array raise BenchmarkDataError with expected patterns, and CLI error reporting exits with code 2 and stderr output.
Baseline reduction script
scripts/reduce_baselines.py
Defines GATED_GROUPS = ("parse", "export", "search") for filtering benchmarks. Introduces _positive_float validator to enforce --slack > 0. reduce_baselines() reads raw pytest-benchmark JSON, validates structure (requiring benchmarks array with name and stats.mean), filters entries to gated groups, scales extracted means by slack factor, and writes baselines.json containing _note, ISO-UTC updated timestamp, machine (from machine_info.system), and groups mapping. CLI main() wires argument parsing, handles BenchmarkDataError with exit code 2, and returns 0 on success.
Baseline reduction tests
tests/test_reduce_baselines.py
Pytest tests validating reduce_baselines() behavior: filters output to only gated groups while excluding non-gated entries, scales baseline means by the slack parameter, rejects input JSON missing the benchmarks array, and CLI rejects non-positive --slack values with SystemExit.
CI integration and baseline data
.github/workflows/ci.yml, Makefile, .gitignore, benchmarks/README.md, benchmarks/baselines.json
CI benchmarks job step renamed from "informational" to "gated"; new "Regression gate" step runs check_benchmark_regression.py on benchmark-results.json and benchmarks/baselines.json before artifact upload. Makefile adds update-baselines target (pytest --benchmark-only to benchmarks/_raw.json then reduce_baselines.py) and check-benchmarks target (pytest --benchmark-only to benchmark-results.json then check_benchmark_regression.py); includes clean-benchmark-artifacts to remove generated files. .gitignore excludes benchmark-results.json and benchmarks/_raw.json. README documents the +20% gate on ubuntu-latest, warning-only behavior for benchmarks without baselines, and make update-baselines refresh workflow with guidance that baselines must be captured on ubuntu-latest for consistency. baselines.json populated with measured parse/export/search timings, corrected machine = "Linux", and updated = "2026-06-17T21:00:00Z" metadata.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • cppalliance/claude-code-chat-browser#76: Introduced the CI benchmarks job with --benchmark-only and artifact upload using the same benchmark-results.json file that this PR's regression gate consumes.

Suggested reviewers

  • timon0305
  • wpak-ai

Poem

🐇 Hoppity-hop through the benchmark land,
Where baselines are measured and gates firmly stand.
If your mean drifts too high — past twenty percent —
The CI cries FAIL and refuses consent.
But new benchmarks just warn with a friendly "take care,"
A gentle-eared rabbit watches performance with flair! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ci: populate baselines and gate benchmarks at +20%' directly and specifically describes the main change—adding a +20% regression gate and populating baseline data in CI.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #83: populates baselines, gates on +20% threshold with graceful missing-baseline handling, documents via README and Makefile, renames CI job, gates ubuntu-latest only, and includes unit tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to benchmark gating infrastructure: scripts, CI workflow, baseline data, documentation, testing, and developer tooling—with no unrelated modifications to session cache or other features.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/benchmark-regression-gate

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
scripts/check_benchmark_regression.py (2)

12-27: ⚡ Quick win

Consider adding defensive error handling for malformed input.

Both load_results and load_baseline_means assume well-formed JSON with the expected schema (e.g., data["benchmarks"], entry["stats"]["mean"]). While pytest-benchmark should always produce correct output, missing keys or malformed JSON would raise unhelpful KeyError or TypeError exceptions in CI.

Adding try/except blocks with actionable error messages would improve debuggability when investigating CI failures.

🛡️ Proposed defensive improvements
 def load_results(results_path: str | Path) -> dict[str, float]:
-    data = json.loads(Path(results_path).read_text(encoding="utf-8"))
-    return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]}
+    try:
+        data = json.loads(Path(results_path).read_text(encoding="utf-8"))
+        return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]}
+    except (KeyError, TypeError) as e:
+        print(f"ERROR: malformed results JSON at {results_path}: {e}", file=sys.stderr)
+        raise

Similar pattern for load_baseline_means.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 12 - 27, Both
load_results and load_baseline_means functions lack error handling for malformed
JSON input and missing expected keys. Add try/except blocks around the JSON
parsing and data access logic in both functions to catch KeyError and TypeError
exceptions. When these exceptions occur, raise a new exception with an
actionable error message that includes the file path and details about which
expected key or structure was missing. This will provide meaningful error
context instead of unhelpful KeyError exceptions when CI receives malformed
JSON.

45-45: ⚡ Quick win

Add guard against division by zero.

If a baseline mean is 0, line 45 will raise ZeroDivisionError. While benchmarks with zero mean are implausible, defensive validation would prevent cryptic CI failures.

🛡️ Proposed guard
+        if base == 0:
+            print(f"WARN: baseline {name!r} has zero mean; skipping")
+            continue
         ratio = cur / base
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` at line 45, The ratio calculation at
the line where ratio is assigned as cur divided by base lacks protection against
division by zero. Add a guard condition to check if the base value is zero
before performing the division operation. If base is zero, handle it
appropriately by either skipping the ratio calculation, assigning a special
value, or logging a warning. This defensive check will prevent ZeroDivisionError
exceptions even though zero baseline means are unlikely in practice.
scripts/reduce_baselines.py (1)

44-49: ⚡ Quick win

Validate --slack parameter bounds.

The --slack argument accepts any float, including negative or zero values, which would produce incorrect baseline means. Adding validation would prevent user errors.

♻️ Proposed validation
     parser.add_argument(
         "--slack",
         type=float,
         default=1.0,
         help="multiply means by this factor (e.g. 1.25 when capturing on a faster host)",
     )
     args = parser.parse_args(argv)
+    if args.slack <= 0:
+        parser.error("--slack must be positive")
     reduce_baselines(args.raw_path, args.out_path, slack=args.slack)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 44 - 49, The `--slack` argument
parser definition lacks validation to ensure the parameter value is valid for
use as a multiplier. Add a constraint to the `--slack` argument definition to
validate that the value is greater than zero, rejecting negative or zero values
that would produce incorrect baseline means. You can use either the `choices`
parameter with a custom validation function or leverage the `type` parameter
with a callable validator to enforce this constraint.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/baselines.json`:
- Line 4: The "machine" field in benchmarks/baselines.json indicates the
baselines were captured on Windows, but the CI gates require ubuntu-latest
baselines to prevent cross-OS timing variance from causing false
positives/negatives. Recapture the baselines on an ubuntu environment by either
running "make update-baselines" after the ubuntu environment is available, or
manually running pytest with "--benchmark-only
--benchmark-json=benchmarks/_raw.json" flags followed by executing the
scripts/reduce_baselines.py script to generate the correct ubuntu-based
baselines.json file.

---

Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 12-27: Both load_results and load_baseline_means functions lack
error handling for malformed JSON input and missing expected keys. Add
try/except blocks around the JSON parsing and data access logic in both
functions to catch KeyError and TypeError exceptions. When these exceptions
occur, raise a new exception with an actionable error message that includes the
file path and details about which expected key or structure was missing. This
will provide meaningful error context instead of unhelpful KeyError exceptions
when CI receives malformed JSON.
- Line 45: The ratio calculation at the line where ratio is assigned as cur
divided by base lacks protection against division by zero. Add a guard condition
to check if the base value is zero before performing the division operation. If
base is zero, handle it appropriately by either skipping the ratio calculation,
assigning a special value, or logging a warning. This defensive check will
prevent ZeroDivisionError exceptions even though zero baseline means are
unlikely in practice.

In `@scripts/reduce_baselines.py`:
- Around line 44-49: The `--slack` argument parser definition lacks validation
to ensure the parameter value is valid for use as a multiplier. Add a constraint
to the `--slack` argument definition to validate that the value is greater than
zero, rejecting negative or zero values that would produce incorrect baseline
means. You can use either the `choices` parameter with a custom validation
function or leverage the `type` parameter with a callable validator to enforce
this constraint.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 575f2d48-af8a-4cd9-9998-d03f8ce771b6

📥 Commits

Reviewing files that changed from the base of the PR and between 047efe8 and 12bf70c.

📒 Files selected for processing (8)
  • .github/workflows/ci.yml
  • .gitignore
  • Makefile
  • benchmarks/README.md
  • benchmarks/baselines.json
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/test_check_benchmark_regression.py

Comment thread benchmarks/baselines.json Outdated
Raise BenchmarkDataError for malformed JSON or missing keys; skip zero
baselines safely; reject non-positive --slack values.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_check_benchmark_regression.py (1)

65-77: ⚡ Quick win

Add an explicit boundary test for exactly +20% (ratio == 1.20).

Current coverage checks above and below threshold, but not the exact cutoff. Since gating is defined as current / baseline > 1.20, add a case (e.g., 0.00012 vs 0.0001) asserting exit code 0 to lock in boundary semantics.

Also applies to: 80-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_check_benchmark_regression.py` around lines 65 - 77, The current
test coverage for the check_regression function does not include an explicit
test for the exact boundary condition where the performance ratio equals 1.20
(the threshold). Add a new test function that verifies the exact boundary case,
such as using a current value of 0.00012 against a baseline of 0.0001, which
produces exactly a 1.20 ratio, and assert that check_regression returns 0
(passes). This locks in the boundary semantics and ensures the gating logic
defined as current / baseline > 1.20 is correct at the exact cutoff point.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_check_benchmark_regression.py`:
- Around line 65-77: The current test coverage for the check_regression function
does not include an explicit test for the exact boundary condition where the
performance ratio equals 1.20 (the threshold). Add a new test function that
verifies the exact boundary case, such as using a current value of 0.00012
against a baseline of 0.0001, which produces exactly a 1.20 ratio, and assert
that check_regression returns 0 (passes). This locks in the boundary semantics
and ensures the gating logic defined as current / baseline > 1.20 is correct at
the exact cutoff point.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7390034-4231-4d61-b70c-0a328b1bbef3

📥 Commits

Reviewing files that changed from the base of the PR and between b9d0707 and 2482951.

📒 Files selected for processing (3)
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/test_check_benchmark_regression.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/reduce_baselines.py
  • scripts/check_benchmark_regression.py

@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit, please review this PR in more depth

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

echo ""
echo "=== Makefile ==="
cat Makefile

echo ""
echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignore

Length of output: 21437



@clean6378-max-it Sure! Here is a detailed review of PR #91.


Deep Review: PR #91 — Benchmark Regression Gate

🔴 Critical

1. ci.yml — Benchmark artifact lost on regression failure

The artifact upload step runs after the regression gate step. When the gate fails (exit 1), GitHub Actions skips subsequent steps by default, so benchmark-results.json is never uploaded — exactly when you need it most for debugging.

# .github/workflows/ci.yml
- name: Regression gate
  run: python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json

- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
  if: always()   # ← add this
  with:
    name: benchmark-results
    path: benchmark-results.json

2. reduce_baselines.py — Unhandled KeyError on malformed input

raw["benchmarks"] is accessed without a try/except. A malformed JSON file (no "benchmarks" key, missing entry["name"], or missing entry["stats"]["mean"]) will produce a bare traceback rather than a diagnostic message:

# reduce_baselines.py — current
for entry in raw["benchmarks"]:
    group = entry.get("group")
    if group not in GATED_GROUPS:
        continue
    groups[group][entry["name"]] = float(entry["stats"]["mean"]) * slack

check_benchmark_regression.py wraps the same access in BenchmarkDataError; reduce_baselines.py should do the same:

try:
    entries = raw["benchmarks"]
except (KeyError, TypeError) as exc:
    raise BenchmarkDataError(f"raw file missing top-level 'benchmarks' array") from exc
for index, entry in enumerate(entries):
    try:
        name = entry["name"]
        mean = float(entry["stats"]["mean"])
    except (KeyError, TypeError) as exc:
        raise BenchmarkDataError(f"benchmarks[{index}] missing 'name' or 'stats.mean'") from exc
    ...

3. check_benchmark_regression.pyBenchmarkDataError uncaught in main()

main() calls check_regression() directly without catching BenchmarkDataError. If the JSON is malformed, the user gets an unformatted Python traceback instead of a clean error message:

def main(argv: list[str] | None = None) -> int:
    argv = sys.argv[1:] if argv is None else argv
    if len(argv) != 2:
        ...
        return 2
    try:
        return check_regression(argv[0], argv[1])
    except BenchmarkDataError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        return 2

🟡 Medium

4. load_baseline_means — Fragile fallback that silently discards data

The line groups = data.get("groups", data) falls back to the entire root dict when "groups" is absent. This means _note, updated, and machine are iterated as group names; they're skipped because their values aren't dicts, but it masks a real misconfiguration. A missing groups key should be an error:

# current — fragile
groups = data.get("groups", data)
if not isinstance(groups, dict):
    raise BenchmarkDataError(f"{path} missing 'groups' object")

# suggested — explicit
if "groups" not in data:
    raise BenchmarkDataError(f"{path} missing required 'groups' key")
groups = data["groups"]
if not isinstance(groups, dict):
    raise BenchmarkDataError(f"{path} 'groups' must be an object")

5. Missing tests for reduce_baselines.py

scripts/reduce_baselines.py has zero tests. Given it writes committed baseline data, it should have at minimum:

Scenario Expected
Valid raw JSON → correct baselines.json structure Pass
Missing "benchmarks" key BenchmarkDataError (once added)
--slack 0 or --slack -1 argparse.ArgumentTypeError
Non-gated group entries filtered out Absent from output
machine_info present → machine field populated Populated correctly

6. Missing test for the exact 1.20× boundary

The gate condition is ratio > threshold (strict), so a ratio of exactly 1.20 must pass. This is not tested:

def test_exactly_at_threshold_passes(tmp_path) -> None:
    results = tmp_path / "results.json"
    baselines = tmp_path / "baselines.json"
    _write_results(results, [{"name": "test_parse_session_small", "stats": {"mean": 0.000120}}])
    _write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
    assert check_regression(results, baselines) == 0  # 1.20x is not > 1.20

7. Missing test for "no current result for baseline" warning path

test_missing_baseline_warns_without_failing tests the case where a result has no matching baseline (WARN path). The opposite case — baseline entry exists but result is absent — is never tested:

def test_missing_current_result_warns_without_failing(tmp_path, capsys) -> None:
    results = tmp_path / "results.json"
    baselines = tmp_path / "baselines.json"
    _write_results(results, [])  # empty
    _write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
    assert check_regression(results, baselines) == 0
    assert "no current result for baseline" in capsys.readouterr().out

8. main() does not expose --threshold as a CLI argument

The THRESHOLD = 1.20 constant is not overridable from the command line. For a tool designed to be run in CI, this makes it harder to locally test different thresholds or temporarily loosen the gate for a branch:

parser.add_argument("--threshold", type=float, default=THRESHOLD,
                    help="regression ratio threshold (default: 1.20)")

🟢 Low / Informational

9. baselines.json — Timestamp appears hand-crafted

The "updated" field reads "2026-06-17T21:00:00Z" — a suspiciously round timestamp. reduce_baselines.py uses datetime.now(UTC) which produces sub-second precision. If this was captured by a real CI run, the timestamp should look like "2026-06-17T21:03:47Z". If the values were manually authored, they may not reflect actual post-cache performance on ubuntu-latest.

10. benchmarks/README.md — Misleading phrasing

"Do not capture baselines on Windows/macOS for commit — ubuntu runners are slower and the gate will fail."

"ubuntu runners are slower" contradicts the intent. The actual reason is that baselines must match the gated runner OS (ubuntu-latest). Consider:

"Baselines must be captured on ubuntu-latest to match the CI runner. Cross-OS variance will cause spurious failures."

11. Makefile — No clean target

After running make update-baselines or make check-benchmarks, benchmarks/_raw.json and benchmark-results.json are left on disk (both in .gitignore). A clean target would be ergonomic:

clean:
	rm -f benchmarks/_raw.json benchmark-results.json

Summary table

# Severity File Issue
1 🔴 ci.yml Artifact upload skipped when regression gate fails
2 🔴 reduce_baselines.py No error handling on malformed input
3 🔴 check_benchmark_regression.py BenchmarkDataError uncaught in main()
4 🟡 check_benchmark_regression.py load_baseline_means fallback silently discards data on missing groups key
5 🟡 tests/ No tests for reduce_baselines.py
6 🟡 tests/ No test for exact 1.20× boundary
7 🟡 tests/ No test for missing-current-result WARN path
8 🟡 check_benchmark_regression.py --threshold not exposed as CLI argument
9 🟢 baselines.json Timestamp looks hand-crafted, not from real CI run
10 🟢 benchmarks/README.md Misleading "ubuntu runners are slower" phrasing
11 🟢 Makefile No clean target

The most impactful fix to make immediately is #1 (if: always() on the artifact upload step) — losing the benchmark JSON on a regression failure defeats the purpose of the gate.

Upload benchmark-results.json even when the gate fails; add BenchmarkDataError
handling to reduce_baselines; require explicit groups key; expose --threshold
CLI flag; add reduce_baselines and boundary tests; clarify README.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/check_benchmark_regression.py (1)

34-41: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap stats.mean numeric conversion in BenchmarkDataError.

At Line 41, float(mean) can raise ValueError/TypeError and escape uncaught, causing a traceback instead of the CLI’s structured error path (ERROR: + exit code 2).

Suggested fix
         try:
             name = entry["name"]
             mean = entry["stats"]["mean"]
         except (KeyError, TypeError) as exc:
             raise BenchmarkDataError(
                 f"{path} benchmarks[{index}] missing 'name' or 'stats.mean'"
             ) from exc
-        results[str(name)] = float(mean)
+        try:
+            results[str(name)] = float(mean)
+        except (TypeError, ValueError) as exc:
+            raise BenchmarkDataError(
+                f"{path} benchmarks[{index}] has non-numeric 'stats.mean'"
+            ) from exc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/check_benchmark_regression.py` around lines 34 - 41, The conversion
of mean to float using float(mean) on line 41 can raise ValueError or TypeError
exceptions that are not caught by the existing try-except block, causing
unhandled exceptions instead of properly formatted CLI errors. Move the results
assignment statement containing float(mean) inside the try block (before the
except clause) so that any conversion errors are caught and wrapped in
BenchmarkDataError along with the other data validation errors.
scripts/reduce_baselines.py (1)

58-63: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate machine_info shape before calling .get().

If machine_info is present but not an object, machine_info.get("system") raises AttributeError and bypasses your structured error handling.

Suggested fix
-    machine_info = raw.get("machine_info", {})
+    machine_info = raw.get("machine_info", {})
+    if machine_info is not None and not isinstance(machine_info, dict):
+        raise BenchmarkDataError(f"{path} 'machine_info' must be an object when present")
@@
-        "machine": machine_info.get("system"),
+        "machine": machine_info.get("system") if isinstance(machine_info, dict) else None,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` around lines 58 - 63, The code calls
machine_info.get("system") without validating that machine_info is actually a
dictionary, which will raise an AttributeError if machine_info is not a dict
object. Before calling .get() on machine_info, add a type check to ensure it is
a dictionary instance, and provide a safe fallback value (such as None) if the
validation fails. This will ensure the structured error handling is not bypassed
by unexpected data types in the machine_info field.
🧹 Nitpick comments (1)
tests/test_reduce_baselines.py (1)

77-78: ⚡ Quick win

Assert the SystemExit code for stricter CLI validation.

Line 77 currently accepts any SystemExit, which can hide unrelated parse failures. Assert exc.value.code == 2 so the test specifically verifies argparse rejection for non-positive --slack.

Proposed test tightening
-    with pytest.raises(SystemExit):
-        main([str(raw), str(tmp_path / "out.json"), "--slack", "0"])
+    with pytest.raises(SystemExit) as exc:
+        main([str(raw), str(tmp_path / "out.json"), "--slack", "0"])
+    assert exc.value.code == 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_reduce_baselines.py` around lines 77 - 78, The
pytest.raises(SystemExit) on the main function call with the `--slack` argument
currently accepts any SystemExit without validating the specific exit code.
Modify the context manager to capture the exception info using `as exc_info` and
then assert that `exc_info.value.code == 2` after the main function call
completes, ensuring the test strictly validates that argparse is rejecting the
non-positive slack value (exit code 2) rather than failing for unrelated
reasons.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 29-33: The code catches json.JSONDecodeError but does not handle
OSError that can be raised by path.read_text() and path.write_text() methods,
allowing raw tracebacks to leak for common CLI failures like missing files or
permission issues. Wrap path.read_text() calls in a try-except block that
catches both json.JSONDecodeError and OSError, converting OSError to
BenchmarkDataError with the error details included in the message. Apply the
same pattern to all path.write_text() calls in the script (at lines 65-66 and
81-85) to ensure consistent error handling throughout and prevent
filesystem-related exceptions from propagating as raw tracebacks.

---

Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 34-41: The conversion of mean to float using float(mean) on line
41 can raise ValueError or TypeError exceptions that are not caught by the
existing try-except block, causing unhandled exceptions instead of properly
formatted CLI errors. Move the results assignment statement containing
float(mean) inside the try block (before the except clause) so that any
conversion errors are caught and wrapped in BenchmarkDataError along with the
other data validation errors.

In `@scripts/reduce_baselines.py`:
- Around line 58-63: The code calls machine_info.get("system") without
validating that machine_info is actually a dictionary, which will raise an
AttributeError if machine_info is not a dict object. Before calling .get() on
machine_info, add a type check to ensure it is a dictionary instance, and
provide a safe fallback value (such as None) if the validation fails. This will
ensure the structured error handling is not bypassed by unexpected data types in
the machine_info field.

---

Nitpick comments:
In `@tests/test_reduce_baselines.py`:
- Around line 77-78: The pytest.raises(SystemExit) on the main function call
with the `--slack` argument currently accepts any SystemExit without validating
the specific exit code. Modify the context manager to capture the exception info
using `as exc_info` and then assert that `exc_info.value.code == 2` after the
main function call completes, ensuring the test strictly validates that argparse
is rejecting the non-positive slack value (exit code 2) rather than failing for
unrelated reasons.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6f0d19f-683d-41e8-a1eb-2f97136d0bc1

📥 Commits

Reviewing files that changed from the base of the PR and between 2482951 and b4bf963.

📒 Files selected for processing (7)
  • .github/workflows/ci.yml
  • Makefile
  • benchmarks/README.md
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/test_check_benchmark_regression.py
  • tests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (1)
  • benchmarks/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • Makefile
  • .github/workflows/ci.yml

Comment thread scripts/reduce_baselines.py
Wrap reduce_baselines I/O in BenchmarkDataError; guard machine_info type;
move float(mean) into try block in load_results; assert argparse exit code 2.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/reduce_baselines.py (1)

11-11: ⚠️ Potential issue | 🟠 Major

Fix import mode mismatch at Line 11 to prevent startup crashes.

from scripts.check_benchmark_regression import BenchmarkDataError fails with ModuleNotFoundError: No module named 'scripts' when this file is run as python scripts/reduce_baselines.py. The scripts directory lacks an __init__.py file and the absolute import does not work in direct script execution mode, despite this being the documented invocation pattern in Makefile and benchmarks/README.md.

Suggested fix
-from scripts.check_benchmark_regression import BenchmarkDataError
+if __package__:
+    from .check_benchmark_regression import BenchmarkDataError
+else:
+    from check_benchmark_regression import BenchmarkDataError
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/reduce_baselines.py` at line 11, The absolute import of
BenchmarkDataError from scripts.check_benchmark_regression fails when
reduce_baselines.py is run directly as a script because the scripts directory is
not a Python package. Change the import statement from the absolute import `from
scripts.check_benchmark_regression import BenchmarkDataError` to a relative
import `from .check_benchmark_regression import BenchmarkDataError` since both
files are in the same scripts directory. This will allow the module to be
correctly resolved when the file is executed as a script.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/reduce_baselines.py`:
- Line 11: The absolute import of BenchmarkDataError from
scripts.check_benchmark_regression fails when reduce_baselines.py is run
directly as a script because the scripts directory is not a Python package.
Change the import statement from the absolute import `from
scripts.check_benchmark_regression import BenchmarkDataError` to a relative
import `from .check_benchmark_regression import BenchmarkDataError` since both
files are in the same scripts directory. This will allow the module to be
correctly resolved when the file is executed as a script.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f630845d-4f13-4730-8970-fe8278b98d3c

📥 Commits

Reviewing files that changed from the base of the PR and between b4bf963 and 6a6cb6e.

📒 Files selected for processing (3)
  • scripts/check_benchmark_regression.py
  • scripts/reduce_baselines.py
  • tests/test_reduce_baselines.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_reduce_baselines.py
  • scripts/check_benchmark_regression.py

Comment thread benchmarks/baselines.json Outdated
Comment thread benchmarks/baselines.json Outdated
Comment thread scripts/reduce_baselines.py
Comment thread scripts/check_benchmark_regression.py Outdated
Comment thread scripts/check_benchmark_regression.py
Comment thread .github/workflows/ci.yml
Drop test_parse_session_small and test_search_full_corpus from gated
baselines; clarify ubuntu CI provenance in _note; detect duplicate
benchmark names; expand reduce_baselines tests.
@clean6378-max-it clean6378-max-it force-pushed the ci/benchmark-regression-gate branch from d8ce6cf to 001c5f0 Compare June 17, 2026 23:30
@clean6378-max-it

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit, can you verify if @timon0305 's all comments are addressed?

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml

echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py

echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py

echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py

echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.py

echo ""
echo "=== Makefile ==="
cat Makefile

echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md

echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json

Length of output: 30061



@clean6378-max-it Here is a full status check against all 11 issues from the earlier review:

# Severity Issue Status
1 🔴 ci.ymlif: always() on artifact upload ✅ Fixed
2 🔴 reduce_baselines.py — unhandled KeyError on malformed input ✅ Fixed
3 🔴 check_benchmark_regression.pyBenchmarkDataError uncaught in main() ✅ Fixed
4 🟡 load_baseline_means — fragile fallback when "groups" key absent ✅ Fixed
5 🟡 No tests for reduce_baselines.py ✅ Fixed (5 tests added in tests/test_reduce_baselines.py)
6 🟡 No test for exact 1.20× boundary ✅ Fixed (test_exactly_at_threshold_passes)
7 🟡 No test for "no current result for baseline" WARN path ✅ Fixed (test_missing_current_result_warns_without_failing)
8 🟡 --threshold not exposed as CLI argument ✅ Fixed (added to argparse in main())
9 🟢 baselines.json — timestamp "2026-06-17T21:00:00Z" appears hand-crafted ⚠️ Still present
10 🟢 benchmarks/README.md — misleading "ubuntu runners are slower" phrasing ✅ Fixed
11 🟢 Makefile — no clean target ✅ Fixed (clean-benchmark-artifacts target added)

Summary: All 3 critical 🔴 issues and all 5 medium 🟡 issues are fully addressed. 2 of 3 low/informational 🟢 issues are also resolved.

The only remaining item is Issue #9baselines.json still has "updated": "2026-06-17T21:00:00Z", which is a suspiciously round timestamp suggesting the file may not have been regenerated from a real CI run via make update-baselines. This is low severity and purely informational, but worth noting if you want the file to reflect a genuine post-cache capture.

@clean6378-max-it clean6378-max-it requested a review from wpak-ai June 18, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

claude-code-chat-browser: Benchmark regression gate in CI

2 participants