feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2)#269
Merged
Conversation
Move 2 spec, built on the Move 6 ruling (contract pivots byte-identity to decision-equivalence). Defines "task" as a fixed (repo@commit, instruction, success_oracle) triple; three oracle types (output-hash / assertion / judge) with a precise per-arm dispersion metric each; the with/without experimental design with token-overhead as a first-class output; and an AgentRunner interface with an omnigent-backed default (grounded: alpha, 5.5k stars, Apache-2.0, drives Claude Code + Codex from one harness) plus a dependency-light direct-CLI fallback. Lands in the packages/eval stub. 7 EARS reqs + 5 open questions for review. NO implementation — review gate first.
…removes
Implements spec 010 (Move 2): the empirical instrument behind the
decision-equivalence contract (Move 6). Given a task triple
(repo @ commit, instruction, success_oracle), the probe runs a coding
agent N times (default 10) per arm — with vs. without the OCH pack in
context — and reports the run-to-run dispersion delta plus token
overhead.
@opencodehub/eval (new, private, dep-light — pure JS + node:child_process,
honoring the package's "ships free of test-time deps" intent; force-bundled
into the CLI tarball via tsup noExternal):
- task loader (YAML/JSON + Zod, strict, fail-fast)
- dispersion stats (distinct-output ratio / Bernoulli pass-rate stddev /
judge-score stddev) — pure, exhaustively unit-covered
- oracle scoring (output_hash | assertion | judge)
- AgentRunner interface + the v1 direct-CLI runner
- deterministic report (canonicalJson, no clock/run-id — R6)
Direct-CLI runner routes BOTH agents' inference through Amazon Bedrock
(spec 010 §4a, grounded against current docs, not recalled):
- Claude Code: CLAUDE_CODE_USE_BEDROCK=1 + us.-prefixed ANTHROPIC_MODEL
inference profile; claude -p ... --output-format json
- Codex: codex exec --json -c model_provider=amazon-bedrock -m ...
CLI: codehub code-pack --variance-probe <task-file> [--runs N]
[--harness claude|codex] [--aws-region R] [--model ID] [--json]. The
command generates the pack once, assembles it into packContext, and runs
the with/without experiment. On-demand only — never a CI gate (§8).
omnigent-backed multi-agent runner deferred to v2 behind the same
interface (CLI-first, per the approved spec).
Adds `eval` to the commitlint scope-enum (new workspace package).
Merged
theagenticguy
pushed a commit
that referenced
this pull request
Jun 30, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.10.5</summary> ## [0.10.5](root-v0.10.4...root-v0.10.5) (2026-06-30) ### Features * **eval:** pack --variance-probe — measure the variance an OCH pack removes (Move 2) ([#269](#269)) ([278702a](278702a)) * **frameworks:** wire stage-5 import/SCIP detection into the profile phase ([#267](#267)) ([6b4d122](6b4d122)) * **pack:** codehub replay — decision-equivalence structural check (Move 6) ([#270](#270)) ([f97b417](f97b417)) </details> <details><summary>cli: 0.10.5</summary> ## [0.10.5](cli-v0.10.4...cli-v0.10.5) (2026-06-30) ### Features * **eval:** pack --variance-probe — measure the variance an OCH pack removes (Move 2) ([#269](#269)) ([278702a](278702a)) * **pack:** codehub replay — decision-equivalence structural check (Move 6) ([#270](#270)) ([f97b417](f97b417)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements spec 010 (Move 2) —
codehub code-pack --variance-probe <task-file>. The probe is the empirical instrument behind the decision-equivalence contract (Move 6): if the OCH pack genuinely pins a coding agent's retrieval decision, the agent's answer wanders less run-to-run. The probe turns that claim into a number.Given a task — a fixed triple
(repo @ commit, instruction, success_oracle)— it runs an agent N times (default 10) in each of two arms (with-pack / without-pack), holding commit/instruction/agent/model fixed, and reports the dispersion delta + token overhead. Grounded against arXiv:2606.26979 ("deterministic anchoring halves run-to-run variance at ~10% more tokens").New package:
@opencodehub/evalThe package was a stub (Python harness extracted to
opencodehub-testbedso the published set ships free of test-time deps). This is its first TS content, and it honors that intent:private, dep-light (pure JS +node:child_process+ zod/yaml, no Python, no heavy deps), and force-bundled into the CLI tarball via tsupnoExternal(verified: 0 surviving external imports indist/index.js).output_hash(zero-config) |assertion(default, objective) |judge(LLM-panel, interface-ready)canonicalJson, no clock/run-id (R6): two probe runs over the same captured outcomes serialize byte-identicallyBedrock inference (hard constraint, spec 010 §4a)
Both agents route inference through Amazon Bedrock, grounded against current docs (not recalled):
CLAUDE_CODE_USE_BEDROCK=1+AWS_REGION+us.-prefixedANTHROPIC_MODELinference profile;claude -p "<prompt>" --output-format json --model <id>; reads.result/.usage/.total_cost_usd.amazon-bedrockprovider;codex exec --json -c model_provider=amazon-bedrock -m <model> --skip-git-repo-check "<prompt>"; final answer = lastagent_message, tokens = lastturn.completed.usage.Notably, neither CLI exposes temperature/seed — which is consistent with the design: within-arm sampling nondeterminism is the variance being measured.
CLI
codehub code-pack --variance-probe <task-file> [--runs N] [--harness claude|codex] [--aws-region R] [--model ID] [--json]. The command generates the pack once, assembles it intopackContext, and runs the experiment. On-demand only — never a CI gate (§8). Example task atpackages/eval/examples/variance-task.yaml.Scope / deferrals
AgentRunnerinterface (CLI-first, per the approved spec).replay/ decision-equivalence structural check is the other half of Move 6 (spec 011, separate).Validation
biome ci .✓ (709 files, 0 errors)tsc -bfull workspace ✓@opencodehub/evalinlined into the CLI bundle (0 external imports)fail 0; +69 new eval tests, +3 new CLI command testsAdds
evalto the commitlint scope-enum (new workspace package).🤖 Generated with Claude Code