[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led) by ayhammouda · Pull Request #70 · ayhammouda/python-docs-mcp-server

ayhammouda · 2026-06-08T21:32:38Z

Refs #63

What changed

Added docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md as the human-led methodology foundation for the v0.5.0 public benchmark.
Locked the benchmark shape before any harness code: eligible systems, 50-question corpus distribution, truth sources, prompt rules, correctness scoring, Claude-token measurement after client rewrap, latency reporting, reproducibility metadata, and honesty rules.
Split the future harness implementation into smaller Refs #63 work packages so agents can later implement plumbing without owning methodology judgment.
Refreshed the locked pyjwt transitive dependency from 2.12.1 to 2.13.0 after GitHub's dependency audit flagged PYSEC-2026-175, PYSEC-2026-177, PYSEC-2026-178, and PYSEC-2026-179 on the initial PR run.

Acceptance notes

Defines the public benchmark methodology before publishing any comparative claim.
Keeps competitor results and claims out of README/PyPI/launch copy until data exists.
Preserves issue [v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led) #63 as human-led; this PR is foundational and does not close the issue.
Keeps the security audit green by updating the lockfile-only pyjwt transitive dependency.

Validation

uv run ruff check src/ tests/ -> passed
uv run pyright src/ -> passed, with upstream pyright update warning only
uv run pytest --tb=short -q -> 307 passed
uv run python-docs-mcp-server doctor -> all checks passed
uv lock --check -> passed
uv export --locked --format requirements-txt --all-groups --all-extras --no-emit-project --no-hashes --output-file /tmp/requirements-audit-63.txt && uvx pip-audit --requirement /tmp/requirements-audit-63.txt --no-deps --disable-pip --progress-spinner off -> no known vulnerabilities found
uv pip compile --quiet pyproject.toml -o /tmp/requirements-check-63.txt -> passed

Why this approach

Issue #63 explicitly says the benchmark is human-led because methodology and corpus selection are maintainer judgment calls. This PR handles that judgment first and leaves the runnable harness as follow-up implementation work.

coderabbitai · 2026-06-08T21:32:44Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)

docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md is excluded by none and included by none
uv.lock is excluded by !**/*.lock and included by none

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bd33ce33-be43-436b-8b35-7682dc5da107

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch agent/63-public-benchmark-methodology

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

agent: docs: define public benchmark methodology

0e20ae5

agent: deps: refresh pyjwt lock

50c40b2

ayhammouda merged commit 80a093c into main Jun 8, 2026
8 checks passed

ayhammouda deleted the agent/63-public-benchmark-methodology branch June 8, 2026 21:36

ayhammouda mentioned this pull request Jun 8, 2026

[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led) #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led)#70

[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led)#70
ayhammouda merged 2 commits into
mainfrom
agent/63-public-benchmark-methodology

ayhammouda commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited by coderabbiteu Bot

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ayhammouda commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Acceptance notes

Validation

Why this approach

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited by coderabbiteu Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ayhammouda commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited by coderabbiteu Bot

Loading