Skip to content

claude-code-chat-browser: Benchmark regression gate in CI #83

@clean6378-max-it

Description

@clean6378-max-it

Calendar Day

Wednesday, June 17, 2026 (PR 2 of 2)

Planned Effort

3 story points — sprint item #6 (Low)

Depends on: Wednesday PR 1 (session cache #4) merged — baselines must reflect cached-path performance.

Builds on: Week 2 PR #76 benchmark harness and benchmarks/baselines.json schema.

Problem

The CI benchmarks job runs pytest tests/benchmarks/ --benchmark-only and uploads artifacts, but is labeled "informational" with no threshold. benchmarks/baselines.json has empty groups — regressions in parse/export/search pass silently.

Goal

One merged PR that populates baselines from a post-cache run, adds a +20% regression gate, documents baseline updates, and renames the CI job to signal it is gated.

Scope

Touch points

  • benchmarks/baselines.json — populate means (parse small/medium/large, export, search)
  • scripts/check_benchmark_regression.py (new) — compare current vs baseline, exit non-zero if >20%
  • .github/workflows/ci.yml — regression step after benchmark run; rename job
  • Makefile or docs — make update-baselines command
  • Unit test for missing-baseline graceful handling (warn, don't fail)

Gate behavior

  • Fail if current_mean / baseline_mean > 1.20
  • Missing baselines for new benchmark names: warn, return 0
  • Gate on ubuntu-latest job only (avoid cross-OS variance)

Acceptance Criteria

  • benchmarks/baselines.json populated from post-cache ubuntu run
  • CI fails on injected >20% regression; passes when green
  • Missing-baseline case warns without failing (tested)
  • make update-baselines (or documented equivalent) regenerates baselines
  • Job renamed from "(informational)" to gated label
  • PR approved by at least 1 reviewer

Verification

cd C:\Users\Jasen\CppAliance\claude-code-chat-browser
.\.venv\Scripts\Activate.ps1
pytest tests/benchmarks/ --benchmark-only
python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json

Out of Scope

  • Session cache implementation (Wednesday PR 1 — #4)
  • New benchmark scenarios beyond existing three bench files

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions