Skip to content

feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2)#269

Merged
theagenticguy merged 2 commits into
mainfrom
feat/variance-probe
Jun 30, 2026
Merged

feat(eval): pack --variance-probe — measure the variance an OCH pack removes (Move 2)#269
theagenticguy merged 2 commits into
mainfrom
feat/variance-probe

Conversation

@theagenticguy

Copy link
Copy Markdown
Owner

What

Implements spec 010 (Move 2)codehub code-pack --variance-probe <task-file>. The probe is the empirical instrument behind the decision-equivalence contract (Move 6): if the OCH pack genuinely pins a coding agent's retrieval decision, the agent's answer wanders less run-to-run. The probe turns that claim into a number.

Given a task — a fixed triple (repo @ commit, instruction, success_oracle) — it runs an agent N times (default 10) in each of two arms (with-pack / without-pack), holding commit/instruction/agent/model fixed, and reports the dispersion delta + token overhead. Grounded against arXiv:2606.26979 ("deterministic anchoring halves run-to-run variance at ~10% more tokens").

New package: @opencodehub/eval

The package was a stub (Python harness extracted to opencodehub-testbed so the published set ships free of test-time deps). This is its first TS content, and it honors that intent: private, dep-light (pure JS + node:child_process + zod/yaml, no Python, no heavy deps), and force-bundled into the CLI tarball via tsup noExternal (verified: 0 surviving external imports in dist/index.js).

  • task loader — YAML/JSON + Zod, strict, fail-fast (an expensive experiment never starts on a bad task)
  • dispersion stats — distinct-output ratio / Bernoulli pass-rate stddev / judge-score stddev; pure, exhaustively unit-covered
  • oracle scoringoutput_hash (zero-config) | assertion (default, objective) | judge (LLM-panel, interface-ready)
  • AgentRunner interface + the v1 direct-CLI runner
  • deterministic reportcanonicalJson, no clock/run-id (R6): two probe runs over the same captured outcomes serialize byte-identically

Bedrock inference (hard constraint, spec 010 §4a)

Both agents route inference through Amazon Bedrock, grounded against current docs (not recalled):

  • Claude CodeCLAUDE_CODE_USE_BEDROCK=1 + AWS_REGION + us.-prefixed ANTHROPIC_MODEL inference profile; claude -p "<prompt>" --output-format json --model <id>; reads .result / .usage / .total_cost_usd.
  • Codex → first-party amazon-bedrock provider; codex exec --json -c model_provider=amazon-bedrock -m <model> --skip-git-repo-check "<prompt>"; final answer = last agent_message, tokens = last turn.completed.usage.

Notably, neither CLI exposes temperature/seed — which is consistent with the design: within-arm sampling nondeterminism is the variance being measured.

CLI

codehub code-pack --variance-probe <task-file> [--runs N] [--harness claude|codex] [--aws-region R] [--model ID] [--json]. The command generates the pack once, assembles it into packContext, and runs the experiment. On-demand only — never a CI gate (§8). Example task at packages/eval/examples/variance-task.yaml.

Scope / deferrals

  • omnigent-backed multi-agent runner → v2, behind the same AgentRunner interface (CLI-first, per the approved spec).
  • The replay / decision-equivalence structural check is the other half of Move 6 (spec 011, separate).

Validation

  • biome ci . ✓ (709 files, 0 errors)
  • tsc -b full workspace ✓
  • full build ✓ — @opencodehub/eval inlined into the CLI bundle (0 external imports)
  • full test suite ✓ — all 18 packages fail 0; +69 new eval tests, +3 new CLI command tests
  • pre-commit (biome, commitlint, banned-strings) + pre-push (verdict, typecheck, test) hooks ✓

Adds eval to the commitlint scope-enum (new workspace package).

🤖 Generated with Claude Code

Move 2 spec, built on the Move 6 ruling (contract pivots byte-identity to
decision-equivalence). Defines "task" as a fixed (repo@commit, instruction,
success_oracle) triple; three oracle types (output-hash / assertion / judge)
with a precise per-arm dispersion metric each; the with/without experimental
design with token-overhead as a first-class output; and an AgentRunner
interface with an omnigent-backed default (grounded: alpha, 5.5k stars,
Apache-2.0, drives Claude Code + Codex from one harness) plus a
dependency-light direct-CLI fallback. Lands in the packages/eval stub. 7 EARS
reqs + 5 open questions for review. NO implementation — review gate first.
…removes

Implements spec 010 (Move 2): the empirical instrument behind the
decision-equivalence contract (Move 6). Given a task triple
(repo @ commit, instruction, success_oracle), the probe runs a coding
agent N times (default 10) per arm — with vs. without the OCH pack in
context — and reports the run-to-run dispersion delta plus token
overhead.

@opencodehub/eval (new, private, dep-light — pure JS + node:child_process,
honoring the package's "ships free of test-time deps" intent; force-bundled
into the CLI tarball via tsup noExternal):
  - task loader (YAML/JSON + Zod, strict, fail-fast)
  - dispersion stats (distinct-output ratio / Bernoulli pass-rate stddev /
    judge-score stddev) — pure, exhaustively unit-covered
  - oracle scoring (output_hash | assertion | judge)
  - AgentRunner interface + the v1 direct-CLI runner
  - deterministic report (canonicalJson, no clock/run-id — R6)

Direct-CLI runner routes BOTH agents' inference through Amazon Bedrock
(spec 010 §4a, grounded against current docs, not recalled):
  - Claude Code: CLAUDE_CODE_USE_BEDROCK=1 + us.-prefixed ANTHROPIC_MODEL
    inference profile; claude -p ... --output-format json
  - Codex: codex exec --json -c model_provider=amazon-bedrock -m ...

CLI: codehub code-pack --variance-probe <task-file> [--runs N]
[--harness claude|codex] [--aws-region R] [--model ID] [--json]. The
command generates the pack once, assembles it into packContext, and runs
the with/without experiment. On-demand only — never a CI gate (§8).

omnigent-backed multi-agent runner deferred to v2 behind the same
interface (CLI-first, per the approved spec).

Adds `eval` to the commitlint scope-enum (new workspace package).
@theagenticguy theagenticguy merged commit 278702a into main Jun 30, 2026
38 checks passed
@theagenticguy theagenticguy deleted the feat/variance-probe branch June 30, 2026 04:33
@github-actions github-actions Bot mentioned this pull request Jun 30, 2026
theagenticguy pushed a commit that referenced this pull request Jun 30, 2026
🤖 Automated release via release-please
---


<details><summary>root: 0.10.5</summary>

##
[0.10.5](root-v0.10.4...root-v0.10.5)
(2026-06-30)


### Features

* **eval:** pack --variance-probe — measure the variance an OCH pack
removes (Move 2)
([#269](#269))
([278702a](278702a))
* **frameworks:** wire stage-5 import/SCIP detection into the profile
phase ([#267](#267))
([6b4d122](6b4d122))
* **pack:** codehub replay — decision-equivalence structural check (Move
6) ([#270](#270))
([f97b417](f97b417))
</details>

<details><summary>cli: 0.10.5</summary>

##
[0.10.5](cli-v0.10.4...cli-v0.10.5)
(2026-06-30)


### Features

* **eval:** pack --variance-probe — measure the variance an OCH pack
removes (Move 2)
([#269](#269))
([278702a](278702a))
* **pack:** codehub replay — decision-equivalence structural check (Move
6) ([#270](#270))
([f97b417](f97b417))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant