feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2) by theagenticguy · Pull Request #269 · theagenticguy/opencodehub

theagenticguy · 2026-06-30T04:28:49Z

What

Implements spec 010 (Move 2) — codehub code-pack --variance-probe <task-file>. The probe is the empirical instrument behind the decision-equivalence contract (Move 6): if the OCH pack genuinely pins a coding agent's retrieval decision, the agent's answer wanders less run-to-run. The probe turns that claim into a number.

Given a task — a fixed triple (repo @ commit, instruction, success_oracle) — it runs an agent N times (default 10) in each of two arms (with-pack / without-pack), holding commit/instruction/agent/model fixed, and reports the dispersion delta + token overhead. Grounded against arXiv:2606.26979 ("deterministic anchoring halves run-to-run variance at ~10% more tokens").

New package: `@opencodehub/eval`

The package was a stub (Python harness extracted to opencodehub-testbed so the published set ships free of test-time deps). This is its first TS content, and it honors that intent: private, dep-light (pure JS + node:child_process + zod/yaml, no Python, no heavy deps), and force-bundled into the CLI tarball via tsup noExternal (verified: 0 surviving external imports in dist/index.js).

task loader — YAML/JSON + Zod, strict, fail-fast (an expensive experiment never starts on a bad task)
dispersion stats — distinct-output ratio / Bernoulli pass-rate stddev / judge-score stddev; pure, exhaustively unit-covered
oracle scoring — output_hash (zero-config) | assertion (default, objective) | judge (LLM-panel, interface-ready)
AgentRunner interface + the v1 direct-CLI runner
deterministic report — canonicalJson, no clock/run-id (R6): two probe runs over the same captured outcomes serialize byte-identically

Bedrock inference (hard constraint, spec 010 §4a)

Both agents route inference through Amazon Bedrock, grounded against current docs (not recalled):

Claude Code → CLAUDE_CODE_USE_BEDROCK=1 + AWS_REGION + us.-prefixed ANTHROPIC_MODEL inference profile; claude -p "<prompt>" --output-format json --model <id>; reads .result / .usage / .total_cost_usd.
Codex → first-party amazon-bedrock provider; codex exec --json -c model_provider=amazon-bedrock -m <model> --skip-git-repo-check "<prompt>"; final answer = last agent_message, tokens = last turn.completed.usage.

Notably, neither CLI exposes temperature/seed — which is consistent with the design: within-arm sampling nondeterminism is the variance being measured.

CLI

codehub code-pack --variance-probe <task-file> [--runs N] [--harness claude|codex] [--aws-region R] [--model ID] [--json]. The command generates the pack once, assembles it into packContext, and runs the experiment. On-demand only — never a CI gate (§8). Example task at packages/eval/examples/variance-task.yaml.

Scope / deferrals

omnigent-backed multi-agent runner → v2, behind the same AgentRunner interface (CLI-first, per the approved spec).
The replay / decision-equivalence structural check is the other half of Move 6 (spec 011, separate).

Validation

biome ci . ✓ (709 files, 0 errors)
tsc -b full workspace ✓
full build ✓ — @opencodehub/eval inlined into the CLI bundle (0 external imports)
full test suite ✓ — all 18 packages fail 0; +69 new eval tests, +3 new CLI command tests
pre-commit (biome, commitlint, banned-strings) + pre-push (verdict, typecheck, test) hooks ✓

Adds eval to the commitlint scope-enum (new workspace package).

🤖 Generated with Claude Code

Move 2 spec, built on the Move 6 ruling (contract pivots byte-identity to decision-equivalence). Defines "task" as a fixed (repo@commit, instruction, success_oracle) triple; three oracle types (output-hash / assertion / judge) with a precise per-arm dispersion metric each; the with/without experimental design with token-overhead as a first-class output; and an AgentRunner interface with an omnigent-backed default (grounded: alpha, 5.5k stars, Apache-2.0, drives Claude Code + Codex from one harness) plus a dependency-light direct-CLI fallback. Lands in the packages/eval stub. 7 EARS reqs + 5 open questions for review. NO implementation — review gate first.

…removes Implements spec 010 (Move 2): the empirical instrument behind the decision-equivalence contract (Move 6). Given a task triple (repo @ commit, instruction, success_oracle), the probe runs a coding agent N times (default 10) per arm — with vs. without the OCH pack in context — and reports the run-to-run dispersion delta plus token overhead. @opencodehub/eval (new, private, dep-light — pure JS + node:child_process, honoring the package's "ships free of test-time deps" intent; force-bundled into the CLI tarball via tsup noExternal): - task loader (YAML/JSON + Zod, strict, fail-fast) - dispersion stats (distinct-output ratio / Bernoulli pass-rate stddev / judge-score stddev) — pure, exhaustively unit-covered - oracle scoring (output_hash | assertion | judge) - AgentRunner interface + the v1 direct-CLI runner - deterministic report (canonicalJson, no clock/run-id — R6) Direct-CLI runner routes BOTH agents' inference through Amazon Bedrock (spec 010 §4a, grounded against current docs, not recalled): - Claude Code: CLAUDE_CODE_USE_BEDROCK=1 + us.-prefixed ANTHROPIC_MODEL inference profile; claude -p ... --output-format json - Codex: codex exec --json -c model_provider=amazon-bedrock -m ... CLI: codehub code-pack --variance-probe <task-file> [--runs N] [--harness claude|codex] [--aws-region R] [--model ID] [--json]. The command generates the pack once, assembles it into packContext, and runs the with/without experiment. On-demand only — never a CI gate (§8). omnigent-backed multi-agent runner deferred to v2 behind the same interface (CLI-first, per the approved spec). Adds `eval` to the commitlint scope-enum (new workspace package).

🤖 Automated release via release-please --- <details><summary>root: 0.10.5</summary> ## [0.10.5](root-v0.10.4...root-v0.10.5) (2026-06-30) ### Features * **eval:** pack --variance-probe — measure the variance an OCH pack removes (Move 2) ([#269](#269)) ([278702a](278702a)) * **frameworks:** wire stage-5 import/SCIP detection into the profile phase ([#267](#267)) ([6b4d122](6b4d122)) * **pack:** codehub replay — decision-equivalence structural check (Move 6) ([#270](#270)) ([f97b417](f97b417)) </details> <details><summary>cli: 0.10.5</summary> ## [0.10.5](cli-v0.10.4...cli-v0.10.5) (2026-06-30) ### Features * **eval:** pack --variance-probe — measure the variance an OCH pack removes (Move 2) ([#269](#269)) ([278702a](278702a)) * **pack:** codehub replay — decision-equivalence structural check (Move 6) ([#270](#270)) ([f97b417](f97b417)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

theagenticguy added 2 commits June 30, 2026 03:49

theagenticguy merged commit 278702a into main Jun 30, 2026
38 checks passed

theagenticguy deleted the feat/variance-probe branch June 30, 2026 04:33

github-actions Bot mentioned this pull request Jun 30, 2026

chore: release main #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2)#269

feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2)#269
theagenticguy merged 2 commits into
mainfrom
feat/variance-probe

theagenticguy commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theagenticguy commented Jun 30, 2026

What

New package: @opencodehub/eval

Bedrock inference (hard constraint, spec 010 §4a)

CLI

Scope / deferrals

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New package: `@opencodehub/eval`