Skip to content

[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led)#70

Merged
ayhammouda merged 2 commits into
mainfrom
agent/63-public-benchmark-methodology
Jun 8, 2026
Merged

[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led)#70
ayhammouda merged 2 commits into
mainfrom
agent/63-public-benchmark-methodology

Conversation

@ayhammouda

@ayhammouda ayhammouda commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Refs #63

What changed

  • Added docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md as the human-led methodology foundation for the v0.5.0 public benchmark.
  • Locked the benchmark shape before any harness code: eligible systems, 50-question corpus distribution, truth sources, prompt rules, correctness scoring, Claude-token measurement after client rewrap, latency reporting, reproducibility metadata, and honesty rules.
  • Split the future harness implementation into smaller Refs #63 work packages so agents can later implement plumbing without owning methodology judgment.
  • Refreshed the locked pyjwt transitive dependency from 2.12.1 to 2.13.0 after GitHub's dependency audit flagged PYSEC-2026-175, PYSEC-2026-177, PYSEC-2026-178, and PYSEC-2026-179 on the initial PR run.

Acceptance notes

Validation

  • uv run ruff check src/ tests/ -> passed
  • uv run pyright src/ -> passed, with upstream pyright update warning only
  • uv run pytest --tb=short -q -> 307 passed
  • uv run python-docs-mcp-server doctor -> all checks passed
  • uv lock --check -> passed
  • uv export --locked --format requirements-txt --all-groups --all-extras --no-emit-project --no-hashes --output-file /tmp/requirements-audit-63.txt && uvx pip-audit --requirement /tmp/requirements-audit-63.txt --no-deps --disable-pip --progress-spinner off -> no known vulnerabilities found
  • uv pip compile --quiet pyproject.toml -o /tmp/requirements-check-63.txt -> passed

Why this approach

Issue #63 explicitly says the benchmark is human-led because methodology and corpus selection are maintainer judgment calls. This PR handles that judgment first and leaves the runnable harness as follow-up implementation work.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)
  • docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md is excluded by none and included by none
  • uv.lock is excluded by !**/*.lock and included by none

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bd33ce33-be43-436b-8b35-7682dc5da107

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent/63-public-benchmark-methodology

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ayhammouda ayhammouda merged commit 80a093c into main Jun 8, 2026
8 checks passed
@ayhammouda ayhammouda deleted the agent/63-public-benchmark-methodology branch June 8, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant