From cd61e5e02fb91431b9941b26f835ba833fb7006a Mon Sep 17 00:00:00 2001
From: Laith Al-Saadoon <alsaadoonlaith@gmail.com>
Date: Tue, 30 Jun 2026 03:49:39 +0000
Subject: [PATCH 1/2] =?UTF-8?q?docs:=20spec=20010=20=E2=80=94=20pack=20--v?=
 =?UTF-8?q?ariance-probe=20(draft=20for=20review)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move 2 spec, built on the Move 6 ruling (contract pivots byte-identity to
decision-equivalence). Defines "task" as a fixed (repo@commit, instruction,
success_oracle) triple; three oracle types (output-hash / assertion / judge)
with a precise per-arm dispersion metric each; the with/without experimental
design with token-overhead as a first-class output; and an AgentRunner
interface with an omnigent-backed default (grounded: alpha, 5.5k stars,
Apache-2.0, drives Claude Code + Codex from one harness) plus a
dependency-light direct-CLI fallback. Lands in the packages/eval stub. 7 EARS
reqs + 5 open questions for review. NO implementation — review gate first.
---
 .erpaval/specs/010-variance-probe/spec.md | 114 ++++++++++++++++++++++
 1 file changed, 114 insertions(+)
 create mode 100644 .erpaval/specs/010-variance-probe/spec.md

diff --git a/.erpaval/specs/010-variance-probe/spec.md b/.erpaval/specs/010-variance-probe/spec.md
new file mode 100644
index 0000000..68c03c0
--- /dev/null
+++ b/.erpaval/specs/010-variance-probe/spec.md
@@ -0,0 +1,114 @@
+# Spec 010 — `pack --variance-probe`: measure the variance an OCH pack removes
+
+**Status:** Draft for review (NO code yet — Laith wants to understand deeply first)
+**Author:** Bonk + Laith · **Date:** 2026-06-30
+**Branch:** `spec/010-variance-probe` (off `main` @ `6b4d122`)
+**Roadmap origin:** M-W-F run 2026-06-29, Move 2 (HIGH). Depends on the Move 6 ruling (below).
+**Grounding validator:** arXiv:2606.26979 — deterministic anchoring "roughly halves run-to-run variance at ~10% more tokens."
+
+---
+
+## 0. The Move 6 ruling this spec is built on (decision-equivalence)
+
+Laith ruled (2026-06-30): **pivot the contract from byte-identity to decision-equivalence.** This reframes what `--variance-probe` measures, so it's stated up front.
+
+- **Old contract (struck):** "same `(commit, tokenizer, budget, pins)` ⇒ byte-identical pack." Brittle — the #252 embedder swap (F2LLM-v2-80M, 320-dim) breaks byte-identity while the retrieval decision is unchanged. An auditor doesn't care about bytes; they care whether the agent saw the right thing.
+- **New contract:** "same inputs ⇒ provably the **same retrieval decision set** (same files + byte ranges selected under the same budget)." Byte-identity becomes one cheap *witness* of decision-equivalence, not the contract itself.
+- **Why this matters for Move 2:** the variance probe is *how you measure the new contract holding*. If the pack genuinely pins the agent's decision, run-to-run answer variance drops. The probe turns "decision-equivalence" from an assertion into a number. So Move 6 (the contract) and Move 2 (the measurement) are one story: **Move 2 is the instrument that proves Move 6's claim.**
+- Move 6 also implies a sibling `codehub replay` that asserts decision-equivalence structurally (same selection, not same bytes) — specced separately (011); 010 is the empirical/behavioral half.
+
+---
+
+## 1. Diagnosis — why this move, why now
+
+OCH's pitch has leaned on the adjective "deterministic." That word is now contested vocabulary (LeanCTX "token receipt", Rel(AI)Build, the receipt-tool cluster all claim it). A **measured variance delta** is a number rivals can't claim by relabeling. arXiv:2606.26979 (Jun 2026) gives the citable result — deterministic anchoring halves agent run-to-run variance — and `--variance-probe` is how OCH demonstrates *its own* pack does this on a real repo, not just cites a paper.
+
+The headline this earns: **"an OCH pack halves how much a coding agent's answer wanders run-to-run, at ~10% token cost"** — backed by a reproducible measurement the user can run.
+
+## 2. The hard part: what is a "task"? (Laith's first question)
+
+A "task" must be precise enough to run repeatedly and score. Definition:
+
+> A **task** is a fixed triple `(repo @ commit, instruction, success_oracle)` run by a coding agent, where the agent's only variable input across the experiment is **whether the OCH pack is in its context**.
+
+- **`repo @ commit`** — a pinned checkout. Frozen so the *only* variable is the pack.
+- **`instruction`** — a natural-language ask given verbatim to the agent every run (e.g. "Add a `--json` flag to `codehub status` that prints the staleness record"). Stored as a string in the task file; never paraphrased between runs.
+- **`success_oracle`** — how a run is scored. Three oracle types, in increasing cost:
+  1. **`output_hash`** (cheapest, no scoring agent) — variance = how often the agent's *final answer text / produced diff* differs across N runs. This measures raw output dispersion. Good for "did the agent converge".
+  2. **`assertion`** — a deterministic check the run either passes or fails (a test command exit code, a grep for a required symbol, a file-exists check). Variance = pass-rate dispersion (e.g. 6/10 with pack vs 3/10 without). This is the most defensible "variance" because it's objective.
+  3. **`judge`** (most expensive) — an LLM-judge panel scores each run's answer 0–1 on a rubric; variance = stddev of the scores. Reserved for tasks with no mechanical oracle.
+
+**What "variance" means precisely:** for a task run N times in each arm (with-pack / without-pack), variance is a per-arm dispersion statistic over the N outcomes:
+- `output_hash` oracle → **distinct-output ratio** = `(# distinct outputs) / N` (1.0 = every run different, 1/N = perfectly stable).
+- `assertion` oracle → **failure-rate stddev** across N, or simply pass-rate (a stabler pass-rate IS lower behavioral variance).
+- `judge` oracle → **stddev of rubric scores** across N.
+
+The probe reports each arm's dispersion and the **delta** (without − with). The Move-2 claim holds when the with-pack arm's dispersion is materially lower.
+
+## 3. The experiment design (with / without)
+
+```
+for arm in [without_pack, with_pack]:
+  for i in 1..N:                      # N = --runs, default 10
+    fresh agent session (no carryover)
+    context = instruction
+            + (arm == with_pack ? the OCH code-pack for repo@commit : nothing)
+    run agent → capture (final_text, diff, oracle_result, tokens)
+  dispersion[arm] = oracleDispersion(results[arm])
+report { without: dispersion[without_pack], with: dispersion[with_pack],
+         delta, tokenOverhead: tokens[with]/tokens[without] }
+```
+
+Controls that make the number honest:
+- **Same instruction, same commit, same agent, same model** across both arms — the pack is the only manipulated variable.
+- **Fresh session per run** — no conversational carryover contaminating variance.
+- **Token overhead reported alongside** — the paper's claim is "halves variance at ~10% more tokens"; a probe that halves variance at 3× tokens is a different (worse) story, so cost is a first-class output, not a footnote.
+- **Determinism of the probe itself** — temperature/seed are pinned where the harness allows; where it doesn't (most agents are nondeterministic by design), that nondeterminism IS the variance being measured, so it's left free *within* an arm but identical *between* arms.
+
+## 4. Harness: how do we actually run the agent? (Laith's omnigent suggestion)
+
+Grounded against the real repo (`omnigent-ai/omnigent`, 5,494★, Apache-2.0, Python 3.12, pushed 2026-06-30, **status: alpha**):
+
+Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Codex, Cursor, OpenCode, Kimi, and custom agents. An "agent" is a YAML (`executor.harness` + `prompt`); you run it with `omnigent run <agent.yaml> --harness <claude-sdk|codex|cursor|…>`, and a `sdks/python-client` drives it programmatically. **This is exactly the shape the probe needs**: one task definition, swap the harness flag, run headless, capture output.
+
+**Why omnigent is the right call for this probe:**
+- The variance story is far stronger if it holds **across agents** (Claude Code AND Codex), not just one — "the pack halves variance regardless of which agent reads it" is the defensible, agent-neutral claim. Omnigent is the only thing that drives both from one interface without per-agent glue.
+- It already solves the headless-run + output-capture + sandbox problem the probe would otherwise hand-roll.
+
+**The tradeoffs I'm not hiding:**
+- **Alpha.** Pinning a specific commit/release is mandatory; its API may move. The probe must tolerate that (thin adapter, see §5).
+- **Heavy dependency.** It pulls a server + sandbox providers. The probe should treat omnigent as an **optional, pluggable runner behind an interface**, NOT a hard dependency of `@opencodehub/eval`. Default to it; allow a simpler direct-CLI runner.
+- **Credentials live in each harness's own login** (`claude` / `codex` CLI auth), not OCH config — fine for a local probe, a documented prerequisite.
+
+**Recommendation:** define a small `AgentRunner` interface (`run(task, withPack) → {finalText, diff, tokens}`); ship an **omnigent-backed runner** as the default multi-agent implementation and a **direct-CLI runner** (shell out to `claude -p` / `codex exec`) as the dependency-light fallback. The probe logic is harness-agnostic; the runner is swappable. This keeps an alpha dep from being load-bearing while getting the cross-agent story omnigent uniquely enables.
+
+## 5. Where it lives + shape
+
+- **`packages/eval`** — currently a stub (README only). This is its first real content: `@opencodehub/eval` gains the variance-probe core (task loading, the experiment loop, dispersion stats, the `AgentRunner` interface + the two runner impls).
+- **CLI surface:** `codehub pack --variance-probe <task-file> [--runs N] [--harness <h>] [--runner omnigent|cli]` → prints the per-arm dispersion + delta + token overhead; `--json` emits the full record.
+- **Task file:** a small YAML/JSON: `{ repo, commit, instruction, oracle: {type, ...}, harness? }`.
+- **Determinism of the probe's own output:** the *report* is a pure function of the captured run outcomes; no clock/run-id in the emitted record (same discipline as the context-bom), so two probe runs over the same captured outcomes serialize identically. (The agent runs themselves are nondeterministic by nature — that's the point.)
+
+## 6. EARS requirements (draft — for review, not yet final)
+
+- **R1** WHEN given a task file `(repo, commit, instruction, oracle)`, the probe SHALL run the agent N times (default 10) in each of the with-pack and without-pack arms, holding commit/instruction/agent/model fixed.
+- **R2** The probe SHALL compute a per-arm dispersion statistic appropriate to the oracle type (distinct-output ratio / pass-rate stddev / judge-score stddev) and report the without−with delta.
+- **R3** The probe SHALL report token overhead (with-pack tokens / without-pack tokens) alongside the variance delta — variance reduction is only meaningful against its cost.
+- **R4** The agent runner SHALL be an interface with at least two implementations: an omnigent-backed multi-agent runner (default) and a direct-CLI runner (no omnigent dependency).
+- **R5** WHERE omnigent (alpha) is unavailable or its API has drifted, the probe SHALL fall back to / be usable via the direct-CLI runner without code change to the probe core.
+- **R6** The emitted `--json` report SHALL be a pure function of the captured run outcomes (no wall-clock/run-id), so the report serialization is reproducible given the same outcomes.
+- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches.
+
+## 7. Open questions for Laith (review these — don't want to build past them)
+
+1. **Default oracle.** I lean `assertion` (objective, defensible) as the documented default, with `output_hash` as the zero-config quick look and `judge` opt-in. Agree, or do you want `judge` front-and-center for the marketing number?
+2. **N (runs per arm).** 10 is the smallest N that gives a believable dispersion; 20+ is more defensible but doubles agent cost (real $ on Claude/Codex). Default 10, configurable?
+3. **Which agents for the headline.** Claude Code + Codex both (the agent-neutral claim) — or is one enough for v1 and the second a follow-up?
+4. **Token-overhead guardrail.** Should the probe *flag* (not fail) when variance drops but token overhead exceeds, say, 1.3× — i.e. "you bought stability expensively"? I think yes; it keeps the claim honest.
+5. **Omnigent now, or CLI-runner first?** Given omnigent is alpha, a defensible v1 is: ship the probe core + the direct-CLI runner first (fast, dependency-light, proves the method), add the omnigent runner second (unlocks the cross-agent story). Or do you want omnigent in v1 because the multi-agent claim IS the point?
+
+## 8. What this is NOT (scope guard)
+
+- Not a SWE-bench-style correctness benchmark — it measures *dispersion*, not capability. (SWE-Explore / SWE-bench remain the publish-against targets, separate.)
+- Not the `replay`/decision-equivalence structural check (spec 011, the other half of Move 6).
+- Not a CI gate — it's an on-demand measurement (agent runs cost real money + minutes).

From 38820812eb4060f9f0aa14bf5ed4baaaa76e68cf Mon Sep 17 00:00:00 2001
From: Laith Al-Saadoon <alsaadoonlaith@gmail.com>
Date: Tue, 30 Jun 2026 04:26:48 +0000
Subject: [PATCH 2/2] =?UTF-8?q?feat(eval):=20pack=20--variance-probe=20?=
 =?UTF-8?q?=E2=80=94=20measure=20the=20variance=20an=20OCH=20pack=20remove?=
 =?UTF-8?q?s?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implements spec 010 (Move 2): the empirical instrument behind the
decision-equivalence contract (Move 6). Given a task triple
(repo @ commit, instruction, success_oracle), the probe runs a coding
agent N times (default 10) per arm — with vs. without the OCH pack in
context — and reports the run-to-run dispersion delta plus token
overhead.

@opencodehub/eval (new, private, dep-light — pure JS + node:child_process,
honoring the package's "ships free of test-time deps" intent; force-bundled
into the CLI tarball via tsup noExternal):
  - task loader (YAML/JSON + Zod, strict, fail-fast)
  - dispersion stats (distinct-output ratio / Bernoulli pass-rate stddev /
    judge-score stddev) — pure, exhaustively unit-covered
  - oracle scoring (output_hash | assertion | judge)
  - AgentRunner interface + the v1 direct-CLI runner
  - deterministic report (canonicalJson, no clock/run-id — R6)

Direct-CLI runner routes BOTH agents' inference through Amazon Bedrock
(spec 010 §4a, grounded against current docs, not recalled):
  - Claude Code: CLAUDE_CODE_USE_BEDROCK=1 + us.-prefixed ANTHROPIC_MODEL
    inference profile; claude -p ... --output-format json
  - Codex: codex exec --json -c model_provider=amazon-bedrock -m ...

CLI: codehub code-pack --variance-probe <task-file> [--runs N]
[--harness claude|codex] [--aws-region R] [--model ID] [--json]. The
command generates the pack once, assembles it into packContext, and runs
the with/without experiment. On-demand only — never a CI gate (§8).

omnigent-backed multi-agent runner deferred to v2 behind the same
interface (CLI-first, per the approved spec).

Adds `eval` to the commitlint scope-enum (new workspace package).
---
 .erpaval/specs/010-variance-probe/spec.md     |  42 ++-
 commitlint.config.mjs                         |   1 +
 packages/cli/package.json                     |   3 +-
 .../cli/src/commands/variance-probe.test.ts   | 130 ++++++++
 packages/cli/src/commands/variance-probe.ts   | 134 ++++++++
 packages/cli/src/index.ts                     |  56 +++-
 packages/cli/tsconfig.json                    |   1 +
 packages/eval/README.md                       |  54 +++-
 packages/eval/examples/variance-task.yaml     |  34 ++
 packages/eval/package.json                    |  67 ++++
 packages/eval/src/cli-runner.test.ts          | 231 ++++++++++++++
 packages/eval/src/cli-runner.ts               | 292 ++++++++++++++++++
 packages/eval/src/dispersion.test.ts          | 120 +++++++
 packages/eval/src/dispersion.ts               |  99 ++++++
 packages/eval/src/index.ts                    |  66 ++++
 packages/eval/src/oracle.test.ts              | 150 +++++++++
 packages/eval/src/oracle.ts                   | 139 +++++++++
 packages/eval/src/probe.test.ts               | 127 ++++++++
 packages/eval/src/probe.ts                    | 153 +++++++++
 packages/eval/src/report.test.ts              | 121 ++++++++
 packages/eval/src/report.ts                   | 145 +++++++++
 packages/eval/src/runner.ts                   |  86 ++++++
 packages/eval/src/task.test.ts                | 116 +++++++
 packages/eval/src/task.ts                     | 160 ++++++++++
 packages/eval/tsconfig.json                   |  10 +
 pnpm-lock.yaml                                |  30 +-
 tsconfig.json                                 |   1 +
 27 files changed, 2551 insertions(+), 17 deletions(-)
 create mode 100644 packages/cli/src/commands/variance-probe.test.ts
 create mode 100644 packages/cli/src/commands/variance-probe.ts
 create mode 100644 packages/eval/examples/variance-task.yaml
 create mode 100644 packages/eval/package.json
 create mode 100644 packages/eval/src/cli-runner.test.ts
 create mode 100644 packages/eval/src/cli-runner.ts
 create mode 100644 packages/eval/src/dispersion.test.ts
 create mode 100644 packages/eval/src/dispersion.ts
 create mode 100644 packages/eval/src/index.ts
 create mode 100644 packages/eval/src/oracle.test.ts
 create mode 100644 packages/eval/src/oracle.ts
 create mode 100644 packages/eval/src/probe.test.ts
 create mode 100644 packages/eval/src/probe.ts
 create mode 100644 packages/eval/src/report.test.ts
 create mode 100644 packages/eval/src/report.ts
 create mode 100644 packages/eval/src/runner.ts
 create mode 100644 packages/eval/src/task.test.ts
 create mode 100644 packages/eval/src/task.ts
 create mode 100644 packages/eval/tsconfig.json

diff --git a/.erpaval/specs/010-variance-probe/spec.md b/.erpaval/specs/010-variance-probe/spec.md
index 68c03c0..8bed539 100644
--- a/.erpaval/specs/010-variance-probe/spec.md
+++ b/.erpaval/specs/010-variance-probe/spec.md
@@ -1,11 +1,15 @@
 # Spec 010 — `pack --variance-probe`: measure the variance an OCH pack removes
 
-**Status:** Draft for review (NO code yet — Laith wants to understand deeply first)
+**Status:** Approved — building (CLI-runner-first; see §7 locked answers). 
 **Author:** Bonk + Laith · **Date:** 2026-06-30
 **Branch:** `spec/010-variance-probe` (off `main` @ `6b4d122`)
 **Roadmap origin:** M-W-F run 2026-06-29, Move 2 (HIGH). Depends on the Move 6 ruling (below).
 **Grounding validator:** arXiv:2606.26979 — deterministic anchoring "roughly halves run-to-run variance at ~10% more tokens."
 
+**Decision log (2026-06-30, Laith):** "all looks good. agree on 5. cli first. just remember for claude code and codex to use bedrock for inference. go forth." → §7 answers locked; v1 ships the direct-CLI runner; omnigent runner deferred to v2; **Claude Code + Codex inference MUST route through Amazon Bedrock** (new §4a hard constraint).
+
+**CLI-surface note:** the OCH code-pack command in this repo is `codehub code-pack` (the bare `codehub pack` is the legacy repomix snapshot). The probe attaches as `codehub code-pack --variance-probe <task-file>`. The spec text says `pack --variance-probe` as shorthand; the implementation wires onto `code-pack`.
+
 ---
 
 ## 0. The Move 6 ruling this spec is built on (decision-equivalence)
@@ -82,6 +86,27 @@ Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Cod
 
 **Recommendation:** define a small `AgentRunner` interface (`run(task, withPack) → {finalText, diff, tokens}`); ship an **omnigent-backed runner** as the default multi-agent implementation and a **direct-CLI runner** (shell out to `claude -p` / `codex exec`) as the dependency-light fallback. The probe logic is harness-agnostic; the runner is swappable. This keeps an alpha dep from being load-bearing while getting the cross-agent story omnigent uniquely enables.
 
+**v1 decision (Laith, 2026-06-30):** ship the **direct-CLI runner first**. omnigent (alpha) is deferred to v2. The `AgentRunner` interface is defined now so the omnigent runner drops in later without touching the probe core.
+
+## 4a. Bedrock inference constraint (hard requirement, Laith 2026-06-30)
+
+Both agents the probe drives **must run inference on Amazon Bedrock**, not the Anthropic/OpenAI first-party APIs. This is grounded against current docs (`code.claude.com/docs`, `developers.openai.com/codex`), not recalled.
+
+**Claude Code → Bedrock** — the runner sets these in the child process env before `claude -p`:
+- `CLAUDE_CODE_USE_BEDROCK=1` — route Claude Code through Bedrock.
+- `AWS_REGION` — e.g. `us-east-1`; credentials resolve via the default AWS SDK chain (env keys / `AWS_PROFILE` / SSO / `AWS_BEARER_TOKEN_BEDROCK`). The probe inherits the operator's AWS env; it does not manage creds.
+- `ANTHROPIC_MODEL` — a **cross-region inference-profile ID** (the `us.` prefix is required for on-demand throughput), e.g. `us.anthropic.claude-sonnet-4-6`. Configurable per task/run; defaults to a sonnet inference profile.
+- Headless invocation: `claude -p "<prompt>" --output-format json --model <inference-profile-id>`.
+- Token usage read from the single JSON result object: `.usage.input_tokens`, `.usage.output_tokens`, `.total_cost_usd`; final answer at `.result`.
+- **No temperature/seed flag exists** in the Claude Code CLI (verified). This is *consistent with* the probe design (§3): within-arm sampling nondeterminism is the variance being measured, held free within an arm and identical between arms.
+
+**Codex → Bedrock** — Codex ships a first-party `amazon-bedrock` provider (OpenAI-compatible "Bedrock Mantle" surface, AWS-native SigV4 / bearer-token auth). The runner invokes:
+- `codex exec --json -c model_provider=amazon-bedrock -m <model> --skip-git-repo-check "<prompt>"` (model e.g. `openai.gpt-5.5`).
+- Auth: `AWS_BEARER_TOKEN_BEDROCK` + `AWS_REGION`, or the AWS SDK credential chain (`AWS_PROFILE` / env keys). Commercial AWS regions only (GovCloud unsupported).
+- Output is JSONL: final answer = last `item.completed` of `type:"agent_message"` (its `.text`); token usage = last `turn.completed.usage` = `{input_tokens, cached_input_tokens, output_tokens, reasoning_output_tokens}` (no `total` field — the runner sums them).
+
+The runner centralizes this Bedrock env/flag wiring per agent so the probe core stays inference-backend-agnostic; a future omnigent runner sets the same env on whichever harness it spawns.
+
 ## 5. Where it lives + shape
 
 - **`packages/eval`** — currently a stub (README only). This is its first real content: `@opencodehub/eval` gains the variance-probe core (task loading, the experiment loop, dispersion stats, the `AgentRunner` interface + the two runner impls).
@@ -97,15 +122,16 @@ Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Cod
 - **R4** The agent runner SHALL be an interface with at least two implementations: an omnigent-backed multi-agent runner (default) and a direct-CLI runner (no omnigent dependency).
 - **R5** WHERE omnigent (alpha) is unavailable or its API has drifted, the probe SHALL fall back to / be usable via the direct-CLI runner without code change to the probe core.
 - **R6** The emitted `--json` report SHALL be a pure function of the captured run outcomes (no wall-clock/run-id), so the report serialization is reproducible given the same outcomes.
-- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches.
+- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches. *(Deferred with the omnigent runner to v2; the version-pin guard ships when that runner does.)*
+- **R8** The direct-CLI runner SHALL route both agents' inference through Amazon Bedrock (§4a): for Claude Code it SHALL set `CLAUDE_CODE_USE_BEDROCK=1`, `AWS_REGION`, and a `us.`-prefixed inference-profile `ANTHROPIC_MODEL`; for Codex it SHALL pass `-c model_provider=amazon-bedrock`. The runner SHALL surface a clear, actionable error when the agent binary is absent or Bedrock auth/region is unconfigured, rather than silently falling back to a first-party API.
 
-## 7. Open questions for Laith (review these — don't want to build past them)
+## 7. Decisions (locked 2026-06-30 — answers to the prior open questions)
 
-1. **Default oracle.** I lean `assertion` (objective, defensible) as the documented default, with `output_hash` as the zero-config quick look and `judge` opt-in. Agree, or do you want `judge` front-and-center for the marketing number?
-2. **N (runs per arm).** 10 is the smallest N that gives a believable dispersion; 20+ is more defensible but doubles agent cost (real $ on Claude/Codex). Default 10, configurable?
-3. **Which agents for the headline.** Claude Code + Codex both (the agent-neutral claim) — or is one enough for v1 and the second a follow-up?
-4. **Token-overhead guardrail.** Should the probe *flag* (not fail) when variance drops but token overhead exceeds, say, 1.3× — i.e. "you bought stability expensively"? I think yes; it keeps the claim honest.
-5. **Omnigent now, or CLI-runner first?** Given omnigent is alpha, a defensible v1 is: ship the probe core + the direct-CLI runner first (fast, dependency-light, proves the method), add the omnigent runner second (unlocks the cross-agent story). Or do you want omnigent in v1 because the multi-agent claim IS the point?
+1. **Default oracle → `assertion`.** Objective and defensible. `output_hash` is the zero-config quick look; `judge` is opt-in for tasks with no mechanical oracle. The headline number is the assertion pass-rate dispersion delta.
+2. **N (runs per arm) → 10, configurable** via `--runs`. Smallest N that gives a believable dispersion; raise for a more defensible publish at linear agent-cost.
+3. **Agents → Claude Code AND Codex.** The agent-neutral claim ("the pack halves variance regardless of which agent reads it") is the point. v1 ships both via the direct-CLI runner; `--harness` selects one or the default runs the configured set.
+4. **Token-overhead guardrail → yes, flag (never fail).** When variance drops but token overhead exceeds ~1.3×, the report flags "stability bought expensively." Keeps the claim honest; the threshold is a reported constant, not a gate.
+5. **Omnigent vs CLI-first → CLI-runner first.** v1 = probe core + direct-CLI runner (fast, dependency-light, proves the method, Bedrock-wired per §4a). omnigent runner is v2, dropping in behind the same `AgentRunner` interface.
 
 ## 8. What this is NOT (scope guard)
 
diff --git a/commitlint.config.mjs b/commitlint.config.mjs
index e43175c..29dbda1 100644
--- a/commitlint.config.mjs
+++ b/commitlint.config.mjs
@@ -42,6 +42,7 @@ export default {
         "cobol-proleap",
         "core-types",
         "embedder",
+        "eval",
         "frameworks",
         "ingestion",
         "mcp",
diff --git a/packages/cli/package.json b/packages/cli/package.json
index a2fdaed..c08c503 100644
--- a/packages/cli/package.json
+++ b/packages/cli/package.json
@@ -39,7 +39,7 @@
     "test": "pnpm run build:test && node --test \"./dist-test/**/*.test.js\"",
     "clean": "rm -rf dist dist-test *.tsbuildinfo"
   },
-  "//deps": "The 14 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).",
+  "//deps": "The 15 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).",
   "dependencies": {
     "@apidevtools/swagger-parser": "12.1.0",
     "@aws-sdk/client-bedrock-runtime": "3.1075.0",
@@ -70,6 +70,7 @@
     "@opencodehub/analysis": "workspace:*",
     "@opencodehub/core-types": "workspace:*",
     "@opencodehub/embedder": "workspace:*",
+    "@opencodehub/eval": "workspace:*",
     "@opencodehub/ingestion": "workspace:*",
     "@opencodehub/mcp": "workspace:*",
     "@opencodehub/pack": "workspace:*",
diff --git a/packages/cli/src/commands/variance-probe.test.ts b/packages/cli/src/commands/variance-probe.test.ts
new file mode 100644
index 0000000..2cbf9ed
--- /dev/null
+++ b/packages/cli/src/commands/variance-probe.test.ts
@@ -0,0 +1,130 @@
+/**
+ * Tests for `runVarianceProbe` (the `codehub code-pack --variance-probe`
+ * handler) and `assemblePackContext`.
+ *
+ * Strategy: inject the `_assemblePackContext` + `_runnerFor` test seams so the
+ * probe runs against a fake agent and a stub pack context — no real analyzed
+ * repo, no `claude`/`codex` spawn, no Bedrock. `assemblePackContext` is tested
+ * against a real on-disk pack directory.
+ */
+
+import { strict as assert } from "node:assert";
+import { mkdtemp, rm, writeFile } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { after, before, describe, it } from "node:test";
+import type { AgentRunner, Harness, RunOutcome, RunRequest } from "@opencodehub/eval";
+import { assemblePackContext, runVarianceProbe } from "./variance-probe.js";
+
+/** A fake runner: stable answer with-pack, distinct answer per run without. */
+class FakeRunner implements AgentRunner {
+  readonly name: string;
+  private i = 0;
+  constructor(harness: Harness) {
+    this.name = `fake:${harness}`;
+  }
+  run(request: RunRequest): Promise<RunOutcome> {
+    this.i += 1;
+    return Promise.resolve({
+      finalText: request.withPack ? "stable" : `wander-${this.i}`,
+      diff: "",
+      tokens: { inputTokens: request.withPack ? 110 : 100, outputTokens: 10, costUsd: null },
+      errored: false,
+    });
+  }
+}
+
+describe("runVarianceProbe (seamed)", () => {
+  let dir: string;
+  let taskFile: string;
+  before(async () => {
+    dir = await mkdtemp(join(tmpdir(), "och-vp-cmd-"));
+    taskFile = join(dir, "task.yaml");
+    await writeFile(
+      taskFile,
+      [
+        "id: cmd-task",
+        `repo: ${dir}`,
+        "commit: abc",
+        "instruction: Add a flag.",
+        "oracle:",
+        "  type: output_hash",
+        "",
+      ].join("\n"),
+      "utf8",
+    );
+  });
+  after(async () => {
+    await rm(dir, { recursive: true, force: true });
+  });
+
+  it("loads the task, injects the stub pack, and reports a positive delta", async () => {
+    const report = await runVarianceProbe({
+      taskFile,
+      runs: 3,
+      harness: "claude",
+      _assemblePackContext: async () => "STUB PACK CONTEXT",
+      _runnerFor: (h) => new FakeRunner(h),
+    });
+    assert.equal(report.taskId, "cmd-task");
+    assert.equal(report.harnesses.length, 1);
+    const h = report.harnesses[0];
+    assert.ok(h !== undefined);
+    assert.equal(h.harness, "claude");
+    assert.ok(h.dispersionDelta > 0, "the stabilizing pack drives a positive delta");
+  });
+
+  it("passes the assembled pack context into the with-pack arm", async () => {
+    let withPackPrompts = 0;
+    const capturing: AgentRunner = {
+      name: "capture",
+      run(req: RunRequest): Promise<RunOutcome> {
+        if (req.withPack) {
+          assert.equal(req.packContext, "ASSEMBLED-PACK");
+          withPackPrompts += 1;
+        } else {
+          assert.equal(req.packContext, undefined, "no pack in the without arm");
+        }
+        return Promise.resolve({
+          finalText: req.withPack ? "s" : `${Math.random()}`,
+          diff: "",
+          tokens: { inputTokens: 1, outputTokens: 1, costUsd: null },
+          errored: false,
+        });
+      },
+    };
+    await runVarianceProbe({
+      taskFile,
+      runs: 2,
+      harness: "codex",
+      _assemblePackContext: async () => "ASSEMBLED-PACK",
+      _runnerFor: () => capturing,
+    });
+    assert.equal(withPackPrompts, 2, "both with-pack runs saw the assembled context");
+  });
+});
+
+describe("assemblePackContext", () => {
+  let packDir: string;
+  before(async () => {
+    packDir = await mkdtemp(join(tmpdir(), "och-vp-pack-"));
+    await writeFile(join(packDir, "readme.md"), "# Pack readme\nhello", "utf8");
+    await writeFile(join(packDir, "skeleton.jsonl"), '{"sym":"foo"}', "utf8");
+    await writeFile(join(packDir, "manifest.json"), '{"packHash":"x"}', "utf8");
+    await writeFile(join(packDir, "context-bom.json"), '{"components":[]}', "utf8");
+  });
+  after(async () => {
+    await rm(packDir, { recursive: true, force: true });
+  });
+
+  it("includes the body files in sorted order and excludes manifest + context-bom", async () => {
+    const ctx = await assemblePackContext(packDir);
+    assert.ok(ctx.includes("### readme.md"));
+    assert.ok(ctx.includes("### skeleton.jsonl"));
+    assert.ok(ctx.includes("# Pack readme"));
+    assert.ok(!ctx.includes("manifest.json"), "manifest excluded (provenance, not content)");
+    assert.ok(!ctx.includes("context-bom.json"), "context-bom excluded");
+    // sorted: readme.md before skeleton.jsonl
+    assert.ok(ctx.indexOf("### readme.md") < ctx.indexOf("### skeleton.jsonl"));
+  });
+});
diff --git a/packages/cli/src/commands/variance-probe.ts b/packages/cli/src/commands/variance-probe.ts
new file mode 100644
index 0000000..e439e78
--- /dev/null
+++ b/packages/cli/src/commands/variance-probe.ts
@@ -0,0 +1,134 @@
+/**
+ * `codehub code-pack --variance-probe <task-file>` — measure the run-to-run
+ * answer variance an OCH pack removes from a coding agent (spec 010 / Move 2).
+ *
+ * Flow:
+ *   1. Load + validate the task file (`@opencodehub/eval`'s `loadTask`).
+ *   2. Generate the OCH code-pack for the task's repo (reuses `runCodePack`),
+ *      then assemble its on-disk artifacts into a single `packContext` string —
+ *      what the with-pack arm injects into the agent's context.
+ *   3. Run the with/without experiment via the Bedrock-wired direct-CLI runner
+ *      (`@opencodehub/eval`'s `CliAgentRunner`), N times per arm per harness.
+ *   4. Emit the report — human summary to stderr, `--json` to stdout.
+ *
+ * `console.log` to stdout is sanctioned in command modules (biome override);
+ * the JSON report goes to stdout so it pipes cleanly, the human summary to
+ * stderr so it never pollutes a piped stdout (the context-bom discipline).
+ *
+ * The probe is on-demand and costs real agent minutes + Bedrock spend — it is
+ * never a CI gate (spec 010 §8).
+ */
+
+import { readdir, readFile } from "node:fs/promises";
+import { join } from "node:path";
+import {
+  type AgentRunner,
+  CliAgentRunner,
+  formatReport,
+  type Harness,
+  loadTask,
+  type ProbeOptions,
+  runProbe,
+  serializeReport,
+  type VarianceReport,
+} from "@opencodehub/eval";
+import { runCodePack } from "./code-pack.js";
+
+export interface VarianceProbeArgs {
+  /** Path to the task file (YAML or JSON). */
+  readonly taskFile: string;
+  /** Runs per arm. Defaults to the probe's DEFAULT_RUNS (10). */
+  readonly runs?: number;
+  /** Restrict to one harness; omitted runs the task's set (default both). */
+  readonly harness?: Harness;
+  /** AWS region for Bedrock inference; falls back to the inherited env. */
+  readonly awsRegion?: string;
+  /** Bedrock model / inference profile override (applies to every harness). */
+  readonly model?: string;
+  /**
+   * Test seam — inject a fake pack-context assembler so unit tests don't need a
+   * real analyzed repo + pack on disk.
+   */
+  readonly _assemblePackContext?: (repo: string) => Promise<string>;
+  /**
+   * Test seam — inject a runner factory so unit tests drive a fake agent
+   * instead of spawning the real `claude` / `codex` CLIs.
+   */
+  readonly _runnerFor?: (harness: Harness) => AgentRunner;
+}
+
+/**
+ * Assemble the on-disk pack directory into a single context string. Reads the
+ * consumer-facing `readme.md` plus the BOM body files (`*.jsonl`, `*.md`) in
+ * sorted order, so the injected context is deterministic. The manifest +
+ * context-bom.json are provenance records, not agent-facing content, so they
+ * are skipped.
+ */
+export async function assemblePackContext(packOutDir: string): Promise<string> {
+  const entries = await readdir(packOutDir, { withFileTypes: true });
+  const files = entries
+    .filter((e) => e.isFile())
+    .map((e) => e.name)
+    .filter((n) => n !== "manifest.json" && n !== "context-bom.json")
+    .sort();
+
+  const parts: string[] = [];
+  for (const name of files) {
+    const body = await readFile(join(packOutDir, name), "utf8");
+    parts.push(`### ${name}\n\n${body}`);
+  }
+  return parts.join("\n\n");
+}
+
+/**
+ * Run the variance probe. Returns the report so callers (and tests) can assert
+ * on it; the CLI action prints it.
+ */
+export async function runVarianceProbe(args: VarianceProbeArgs): Promise<VarianceReport> {
+  const task = await loadTask(args.taskFile);
+
+  // 1. Generate the OCH pack for the task's repo (requires it to be analyzed).
+  //    Then assemble its artifacts into the context the with-pack arm injects.
+  const assemble = args._assemblePackContext ?? defaultAssemble;
+  const packContext = await assemble(task.repo);
+
+  // 2. The runner factory: a Bedrock-wired direct-CLI runner per harness
+  //    (spec 010 §4a). Tests inject a fake via `_runnerFor`.
+  const runnerFor =
+    args._runnerFor ??
+    ((harness: Harness) =>
+      new CliAgentRunner({
+        harness,
+        ...(args.model !== undefined ? { model: args.model } : {}),
+        ...(args.awsRegion !== undefined ? { awsRegion: args.awsRegion } : {}),
+      }));
+
+  const options: ProbeOptions = {
+    packContext,
+    ...(args.runs !== undefined ? { runs: args.runs } : {}),
+    ...(args.harness !== undefined ? { harnesses: [args.harness] } : {}),
+  };
+
+  return runProbe(task, runnerFor, options);
+}
+
+/**
+ * Production pack-context assembler: generate the pack for `repo` via the same
+ * `runCodePack` the bare `code-pack` command uses, then read its artifacts.
+ */
+async function defaultAssemble(repo: string): Promise<string> {
+  const result = await runCodePack({ repo, engine: "pack" });
+  return assemblePackContext(result.outDir);
+}
+
+/**
+ * Print a {@link VarianceReport}. JSON → stdout (machine consumers / `--json`);
+ * the human summary → stderr so it never pollutes a piped stdout.
+ */
+export function printVarianceReport(report: VarianceReport, asJson: boolean): void {
+  if (asJson) {
+    console.log(serializeReport(report));
+  } else {
+    console.warn(formatReport(report));
+  }
+}
diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts
index b762ea7..3b4d6f1 100644
--- a/packages/cli/src/index.ts
+++ b/packages/cli/src/index.ts
@@ -375,8 +375,49 @@ program
     "After packing, print a summary of the context read-receipt (files indexed, lines, " +
       "hash coverage, per-language breakdown) read from the pack's context-bom.json",
   )
-  .option("--json", "With --explain-context, emit the receipt summary as JSON on stdout")
+  .option("--json", "With --explain-context or --variance-probe, emit the result as JSON on stdout")
+  .option(
+    "--variance-probe <task-file>",
+    "Measure the run-to-run answer variance an OCH pack removes from a coding agent " +
+      "(spec 010 / Move 2). Loads the task file, generates the pack, runs the agent N times " +
+      "with vs. without the pack, and reports the dispersion delta + token overhead. Agents run " +
+      "on Amazon Bedrock. On-demand only — costs real agent minutes + Bedrock spend, never a CI gate.",
+  )
+  .option("--runs <n>", "With --variance-probe: runs per arm (default 10)", (v) =>
+    Number.parseInt(v, 10),
+  )
+  .option(
+    "--harness <harness>",
+    "With --variance-probe: restrict to one agent — claude or codex (default: both)",
+  )
+  .option(
+    "--aws-region <region>",
+    "With --variance-probe: AWS region for Bedrock inference (default: inherited AWS_REGION)",
+  )
+  .option(
+    "--model <id>",
+    "With --variance-probe: Bedrock model / inference-profile id (default per harness)",
+  )
   .action(async (path: string | undefined, opts: Record<string, unknown>) => {
+    // --variance-probe short-circuits the normal pack path: it loads a task,
+    // generates the pack itself, and runs the with/without experiment.
+    if (typeof opts["varianceProbe"] === "string") {
+      const probeMod = await import("./commands/variance-probe.js");
+      const harness = parseHarness(opts["harness"]);
+      const runs =
+        typeof opts["runs"] === "number" && Number.isFinite(opts["runs"])
+          ? opts["runs"]
+          : undefined;
+      const report = await probeMod.runVarianceProbe({
+        taskFile: opts["varianceProbe"],
+        ...(runs !== undefined ? { runs } : {}),
+        ...(harness !== undefined ? { harness } : {}),
+        ...(typeof opts["awsRegion"] === "string" ? { awsRegion: opts["awsRegion"] } : {}),
+        ...(typeof opts["model"] === "string" ? { model: opts["model"] } : {}),
+      });
+      probeMod.printVarianceReport(report, opts["json"] === true);
+      return;
+    }
     const mod = await import("./commands/code-pack.js");
     const rawEngine = typeof opts["engine"] === "string" ? opts["engine"] : "pack";
     const engine: "pack" | "repomix" =
@@ -1032,6 +1073,19 @@ function parseQueryGranularity(raw: unknown): "symbol" | "file" | "community" |
   );
 }
 
+/**
+ * Parse a `--harness` value into the narrow agent set the variance probe
+ * accepts. Returns `undefined` when the flag was not supplied (the probe then
+ * runs the task's configured set — both agents by default). An unknown token
+ * throws so the user sees the typo rather than a silent fallback.
+ */
+function parseHarness(raw: unknown): "claude" | "codex" | undefined {
+  if (typeof raw !== "string" || raw.trim() === "") return undefined;
+  const trimmed = raw.trim();
+  if (trimmed === "claude" || trimmed === "codex") return trimmed;
+  throw new Error(`Unknown --harness value: "${trimmed}". Expected one of: claude, codex`);
+}
+
 /**
  * Parse a comma-separated `--granularity` value into the narrow set of
  * hierarchical embedding tiers the ingestion phase accepts. Returns
diff --git a/packages/cli/tsconfig.json b/packages/cli/tsconfig.json
index 44e105a..fb355d3 100644
--- a/packages/cli/tsconfig.json
+++ b/packages/cli/tsconfig.json
@@ -10,6 +10,7 @@
     { "path": "../analysis" },
     { "path": "../core-types" },
     { "path": "../embedder" },
+    { "path": "../eval" },
     { "path": "../storage" },
     { "path": "../search" },
     { "path": "../ingestion" },
diff --git a/packages/eval/README.md b/packages/eval/README.md
index f31fa03..4c23e33 100644
--- a/packages/eval/README.md
+++ b/packages/eval/README.md
@@ -1,7 +1,49 @@
-# @opencodehub/eval — extracted
+# @opencodehub/eval — variance probe
 
-The Python retrieval / graph-quality evaluation harness that used to live
-here was extracted to the sibling `opencodehub-testbed` repository so the
-production package set ships free of test-time dependencies. Any local
-`.venv/`, `.pytest_cache/`, `.ruff_cache/`, or `src/` left in this folder
-is untracked and gitignored — see `opencodehub-testbed` for the harness.
+`@opencodehub/eval` measures the **run-to-run answer variance** a coding agent
+shows with vs. without an OpenCodeHub code-pack in its context. It is the
+empirical instrument behind the decision-equivalence contract (Move 6): if the
+pack genuinely pins the agent's retrieval decision, the agent's answer wanders
+less across repeated runs. The probe turns that claim into a number.
+
+> The Python retrieval / graph-quality harness that *used* to live here was
+> extracted to the sibling `opencodehub-testbed` repo so the published package
+> set ships free of test-time dependencies. This TypeScript probe honors that
+> intent: it is pure JS + `node:child_process` (no heavy runtime deps, no
+> Python), and the package is `private`. It is force-bundled into the
+> `@opencodehub/cli` tarball (tsup `noExternal`), so it never adds an
+> independently published runtime package.
+
+## What it does
+
+Given a **task** — a fixed triple `(repo @ commit, instruction, success_oracle)` —
+the probe runs a coding agent N times in each of two arms (with-pack /
+without-pack), holding commit, instruction, agent, and model fixed. The only
+manipulated variable is whether the OCH pack is in context. It then computes a
+per-arm dispersion statistic appropriate to the oracle and reports the
+`without − with` delta alongside the token overhead.
+
+| Oracle | Dispersion statistic | Use |
+|---|---|---|
+| `output_hash` | distinct-output ratio `(# distinct outputs)/N` | zero-config quick look |
+| `assertion` (default) | pass-rate + failure-rate stddev across N | objective, defensible headline |
+| `judge` | stddev of LLM-panel rubric scores | tasks with no mechanical oracle |
+
+## Inference backend
+
+The direct-CLI runner drives `claude -p` (Claude Code) and `codex exec`
+(Codex), and **both route inference through Amazon Bedrock** (spec 010 §4a):
+Claude Code via `CLAUDE_CODE_USE_BEDROCK=1` + a `us.`-prefixed
+`ANTHROPIC_MODEL` inference profile; Codex via its first-party
+`amazon-bedrock` provider. AWS credentials and region are inherited from the
+operator's environment.
+
+## Determinism of the probe's own output
+
+The agent runs are nondeterministic by nature — that nondeterminism is exactly
+what's being measured. The probe's *report*, by contrast, is a pure function of
+the captured run outcomes: no wall-clock, no run-id. Two probe runs over the
+same captured outcomes serialize byte-identically (same discipline as the
+context-bom).
+
+See `.erpaval/specs/010-variance-probe/spec.md` for the full design.
diff --git a/packages/eval/examples/variance-task.yaml b/packages/eval/examples/variance-task.yaml
new file mode 100644
index 0000000..20b26d4
--- /dev/null
+++ b/packages/eval/examples/variance-task.yaml
@@ -0,0 +1,34 @@
+# Example variance-probe task (spec 010 / Move 2).
+#
+# Run with:
+#   codehub code-pack --variance-probe packages/eval/examples/variance-task.yaml
+#
+# The probe generates the OCH pack for `repo`, then runs the agent N times
+# (default 10) per arm — with vs. without the pack in context — and reports the
+# dispersion delta + token overhead. Agents run on Amazon Bedrock; the probe
+# inherits your AWS region + credentials from the environment.
+
+# Human-facing id, used to label the report.
+id: add-json-flag-to-status
+
+# A pinned checkout. Frozen so the pack is the only variable across runs.
+repo: /abs/path/to/your/repo
+commit: 0000000000000000000000000000000000000000
+
+# The natural-language ask, given verbatim to the agent every run.
+instruction: >
+  Add a --json flag to `codehub status` that prints the staleness record as
+  JSON on stdout. Keep the existing human-readable output as the default.
+
+# How a run is scored. The default oracle is `assertion` — objective and
+# defensible. Exit code 0 = pass; the probe reports pass-rate dispersion across
+# the N runs in each arm.
+oracle:
+  type: assertion
+  command: pnpm --filter @opencodehub/cli build:test && node --test "./packages/cli/dist-test/commands/status.test.js"
+  timeoutMs: 300000
+
+# Optional: pin one agent. Omit to run the default set (claude + codex), which
+# gives the agent-neutral claim ("the pack halves variance regardless of which
+# agent reads it").
+# harness: claude
diff --git a/packages/eval/package.json b/packages/eval/package.json
new file mode 100644
index 0000000..0ab2989
--- /dev/null
+++ b/packages/eval/package.json
@@ -0,0 +1,67 @@
+{
+  "name": "@opencodehub/eval",
+  "version": "0.1.0",
+  "private": true,
+  "description": "OpenCodeHub — variance probe: measure the run-to-run answer variance an OCH pack removes from a coding agent",
+  "license": "Apache-2.0",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/theagenticguy/opencodehub.git",
+    "directory": "packages/eval"
+  },
+  "homepage": "https://github.com/theagenticguy/opencodehub#readme",
+  "bugs": {
+    "url": "https://github.com/theagenticguy/opencodehub/issues"
+  },
+  "type": "module",
+  "main": "./dist/index.js",
+  "types": "./dist/index.d.ts",
+  "exports": {
+    ".": {
+      "types": "./dist/index.d.ts",
+      "import": "./dist/index.js"
+    }
+  },
+  "files": [
+    "dist/**/*.js",
+    "!dist/**/*.test.js",
+    "dist/**/*.d.ts",
+    "!dist/**/*.test.d.ts",
+    "dist/**/*.js.map",
+    "!dist/**/*.test.js.map",
+    "dist/**/*.d.ts.map",
+    "!dist/**/*.test.d.ts.map"
+  ],
+  "scripts": {
+    "build": "tsc -b",
+    "test": "node --test \"./dist/**/*.test.js\"",
+    "clean": "rm -rf dist *.tsbuildinfo"
+  },
+  "dependencies": {
+    "@opencodehub/core-types": "workspace:*",
+    "yaml": "2.9.0",
+    "zod": "4.4.3"
+  },
+  "devDependencies": {
+    "@types/node": "26.0.1",
+    "typescript": "6.0.3"
+  },
+  "publishConfig": {
+    "access": "public"
+  },
+  "keywords": [
+    "opencodehub",
+    "code-intelligence",
+    "mcp",
+    "model-context-protocol",
+    "ai",
+    "code-graph",
+    "variance",
+    "determinism",
+    "eval",
+    "agent-eval"
+  ],
+  "engines": {
+    "node": ">=24.15.0"
+  }
+}
diff --git a/packages/eval/src/cli-runner.test.ts b/packages/eval/src/cli-runner.test.ts
new file mode 100644
index 0000000..08e744e
--- /dev/null
+++ b/packages/eval/src/cli-runner.test.ts
@@ -0,0 +1,231 @@
+import { strict as assert } from "node:assert";
+import { describe, it } from "node:test";
+import {
+  buildAgentEnv,
+  buildArgv,
+  CliAgentRunner,
+  composePrompt,
+  DEFAULT_CLAUDE_MODEL,
+  DEFAULT_CODEX_MODEL,
+  parseClaudeOutput,
+  parseCodexOutput,
+  type SpawnFn,
+} from "./cli-runner.js";
+import type { RunRequest } from "./runner.js";
+import type { Task } from "./task.js";
+
+const TASK: Task = {
+  id: "t",
+  repo: "/tmp/repo",
+  commit: "deadbeef",
+  instruction: "Add a --json flag.",
+  oracle: { type: "output_hash", field: "final_text" },
+};
+
+describe("buildAgentEnv — Bedrock wiring (§4a)", () => {
+  it("sets CLAUDE_CODE_USE_BEDROCK and a us.-profile model for claude", () => {
+    const env = buildAgentEnv({ harness: "claude", awsRegion: "us-east-1", _baseEnv: {} });
+    assert.equal(env["CLAUDE_CODE_USE_BEDROCK"], "1");
+    assert.equal(env["ANTHROPIC_MODEL"], DEFAULT_CLAUDE_MODEL);
+    assert.ok(env["ANTHROPIC_MODEL"]?.startsWith("us."), "inference profile is us.-prefixed");
+    assert.equal(env["AWS_REGION"], "us-east-1");
+  });
+
+  it("honors an explicit model override for claude", () => {
+    const env = buildAgentEnv({
+      harness: "claude",
+      model: "us.anthropic.claude-opus-4-8",
+      _baseEnv: {},
+    });
+    assert.equal(env["ANTHROPIC_MODEL"], "us.anthropic.claude-opus-4-8");
+  });
+
+  it("does NOT set Claude Bedrock vars for the codex harness (codex uses argv -c)", () => {
+    const env = buildAgentEnv({ harness: "codex", awsRegion: "us-west-2", _baseEnv: {} });
+    assert.equal(env["CLAUDE_CODE_USE_BEDROCK"], undefined);
+    assert.equal(env["ANTHROPIC_MODEL"], undefined);
+    assert.equal(env["AWS_REGION"], "us-west-2");
+  });
+
+  it("falls back to the inherited AWS_REGION / AWS_DEFAULT_REGION", () => {
+    const env = buildAgentEnv({ harness: "claude", _baseEnv: { AWS_DEFAULT_REGION: "eu-west-1" } });
+    assert.equal(env["AWS_REGION"], "eu-west-1");
+  });
+
+  it("preserves inherited credentials (does not strip AWS_PROFILE / keys)", () => {
+    const env = buildAgentEnv({
+      harness: "claude",
+      _baseEnv: { AWS_PROFILE: "bedrock-dev", AWS_REGION: "us-east-1" },
+    });
+    assert.equal(env["AWS_PROFILE"], "bedrock-dev");
+  });
+});
+
+describe("buildArgv", () => {
+  it("builds the headless claude command with JSON output + model", () => {
+    const { command, argv } = buildArgv({ harness: "claude" }, "PROMPT");
+    assert.equal(command, "claude");
+    assert.deepEqual(argv, [
+      "-p",
+      "PROMPT",
+      "--output-format",
+      "json",
+      "--model",
+      DEFAULT_CLAUDE_MODEL,
+    ]);
+  });
+
+  it("builds the codex exec command routed to the amazon-bedrock provider", () => {
+    const { command, argv } = buildArgv({ harness: "codex" }, "PROMPT");
+    assert.equal(command, "codex");
+    assert.deepEqual(argv, [
+      "exec",
+      "--json",
+      "-c",
+      "model_provider=amazon-bedrock",
+      "-m",
+      DEFAULT_CODEX_MODEL,
+      "--skip-git-repo-check",
+      "PROMPT",
+    ]);
+  });
+});
+
+describe("composePrompt", () => {
+  it("returns the bare instruction in the without-pack arm", () => {
+    const req: RunRequest = { task: TASK, harness: "claude", withPack: false };
+    assert.equal(composePrompt(req), TASK.instruction);
+  });
+  it("prepends the pack context in the with-pack arm", () => {
+    const req: RunRequest = {
+      task: TASK,
+      harness: "claude",
+      withPack: true,
+      packContext: "PACK BODY",
+    };
+    const prompt = composePrompt(req);
+    assert.ok(prompt.startsWith("PACK BODY"));
+    assert.ok(prompt.endsWith(TASK.instruction));
+  });
+  it("ignores an empty pack context (falls back to bare instruction)", () => {
+    const req: RunRequest = { task: TASK, harness: "claude", withPack: true, packContext: "" };
+    assert.equal(composePrompt(req), TASK.instruction);
+  });
+});
+
+describe("parseClaudeOutput", () => {
+  it("extracts result text + usage + cost from the JSON result object", () => {
+    const stdout = JSON.stringify({
+      type: "result",
+      subtype: "success",
+      result: "Done — added the flag.",
+      usage: { input_tokens: 1234, output_tokens: 56 },
+      total_cost_usd: 0.0123,
+    });
+    const { finalText, tokens } = parseClaudeOutput(stdout);
+    assert.equal(finalText, "Done — added the flag.");
+    assert.equal(tokens.inputTokens, 1234);
+    assert.equal(tokens.outputTokens, 56);
+    assert.equal(tokens.costUsd, 0.0123);
+  });
+  it("tolerates a missing usage block (zeros, null cost)", () => {
+    const { tokens, finalText } = parseClaudeOutput(JSON.stringify({ result: "x" }));
+    assert.equal(finalText, "x");
+    assert.equal(tokens.inputTokens, 0);
+    assert.equal(tokens.costUsd, null);
+  });
+  it("throws on unparseable stdout", () => {
+    assert.throws(() => parseClaudeOutput("not json"));
+  });
+});
+
+describe("parseCodexOutput", () => {
+  const jsonl = [
+    JSON.stringify({ type: "thread.started", thread_id: "x" }),
+    JSON.stringify({ type: "turn.started" }),
+    "some progress noise that is not json",
+    JSON.stringify({
+      type: "item.completed",
+      item: { id: "i1", type: "command_execution", status: "completed" },
+    }),
+    JSON.stringify({
+      type: "item.completed",
+      item: { id: "i2", type: "agent_message", text: "Repo summary here." },
+    }),
+    JSON.stringify({
+      type: "turn.completed",
+      usage: {
+        input_tokens: 24763,
+        cached_input_tokens: 24448,
+        output_tokens: 122,
+        reasoning_output_tokens: 0,
+      },
+    }),
+  ].join("\n");
+
+  it("extracts the final agent_message text and the last turn.completed usage", () => {
+    const { finalText, tokens } = parseCodexOutput(jsonl);
+    assert.equal(finalText, "Repo summary here.");
+    assert.equal(tokens.inputTokens, 24763);
+    assert.equal(tokens.outputTokens, 122);
+    assert.equal(tokens.costUsd, null, "codex exposes no per-invocation USD cost");
+  });
+
+  it("returns empty/zeros when no agent_message is present", () => {
+    const { finalText, tokens } = parseCodexOutput(JSON.stringify({ type: "turn.started" }));
+    assert.equal(finalText, "");
+    assert.equal(tokens.inputTokens, 0);
+  });
+});
+
+describe("CliAgentRunner.run (stubbed spawn)", () => {
+  it("returns a successful outcome from a claude run", async () => {
+    const spawnFn: SpawnFn = async ({ command, argv }) => {
+      assert.equal(command, "claude");
+      assert.ok(argv.includes("--output-format"));
+      return {
+        stdout: JSON.stringify({ result: "ok", usage: { input_tokens: 10, output_tokens: 2 } }),
+        stderr: "",
+        code: 0,
+      };
+    };
+    const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} });
+    const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false });
+    assert.equal(outcome.errored, false);
+    assert.equal(outcome.finalText, "ok");
+    assert.equal(outcome.tokens.inputTokens, 10);
+    assert.equal(outcome.checkoutPath, TASK.repo);
+    assert.equal(runner.name, "cli:claude");
+  });
+
+  it("marks the outcome errored on a non-zero exit (no throw)", async () => {
+    const spawnFn: SpawnFn = async () => ({ stdout: "", stderr: "boom", code: 1 });
+    const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} });
+    const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false });
+    assert.equal(outcome.errored, true);
+    assert.equal(outcome.tokens.inputTokens, 0);
+  });
+
+  it("marks the outcome errored when stdout is unparseable", async () => {
+    const spawnFn: SpawnFn = async () => ({ stdout: "garbage", stderr: "", code: 0 });
+    const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} });
+    const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false });
+    assert.equal(outcome.errored, true);
+  });
+
+  it("injects the pack context into the prompt on the with-pack arm", async () => {
+    let seenPrompt = "";
+    const spawnFn: SpawnFn = async ({ argv }) => {
+      // claude argv: ["-p", PROMPT, ...]
+      seenPrompt = argv[1] ?? "";
+      return {
+        stdout: JSON.stringify({ result: "ok", usage: { input_tokens: 1, output_tokens: 1 } }),
+        stderr: "",
+        code: 0,
+      };
+    };
+    const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} });
+    await runner.run({ task: TASK, harness: "claude", withPack: true, packContext: "THE PACK" });
+    assert.ok(seenPrompt.startsWith("THE PACK"));
+  });
+});
diff --git a/packages/eval/src/cli-runner.ts b/packages/eval/src/cli-runner.ts
new file mode 100644
index 0000000..fbbf0cf
--- /dev/null
+++ b/packages/eval/src/cli-runner.ts
@@ -0,0 +1,292 @@
+/**
+ * Direct-CLI runner (spec 010 §4, §4a) — v1's {@link AgentRunner}.
+ *
+ * Shells out to `claude -p` (Claude Code) and `codex exec` (Codex) in headless
+ * mode, with **inference routed through Amazon Bedrock** for both. The Bedrock
+ * env/flag wiring is grounded against current docs (code.claude.com,
+ * developers.openai.com/codex), not recalled:
+ *
+ *   Claude Code → Bedrock:
+ *     env  CLAUDE_CODE_USE_BEDROCK=1, AWS_REGION, ANTHROPIC_MODEL=<us.-profile>
+ *     cmd  claude -p "<prompt>" --output-format json --model <profile>
+ *     out  one JSON object: .result, .usage.{input,output}_tokens, .total_cost_usd
+ *
+ *   Codex → Bedrock (first-party `amazon-bedrock` provider):
+ *     env  AWS_REGION (+ AWS_BEARER_TOKEN_BEDROCK or AWS SDK creds)
+ *     cmd  codex exec --json -c model_provider=amazon-bedrock -m <model>
+ *            --skip-git-repo-check "<prompt>"
+ *     out  JSONL; final = last item.completed type "agent_message"; tokens =
+ *          last turn.completed.usage (no total — sum input+output)
+ *
+ * The pure pieces (env construction, prompt composition, output parsing) are
+ * exported and unit-tested; the spawn itself is integration-only (needs the
+ * CLIs + AWS creds, which CI sandboxes lack) and is exercised via the
+ * `_spawn` seam in tests.
+ */
+
+import { spawn } from "node:child_process";
+import type { AgentRunner, Harness, RunOutcome, RunRequest, RunTokens } from "./runner.js";
+
+/** Default Claude Code Bedrock inference profile (us.-prefixed, §4a). */
+export const DEFAULT_CLAUDE_MODEL = "us.anthropic.claude-sonnet-4-6";
+/** Default Codex Bedrock model id (§4a). */
+export const DEFAULT_CODEX_MODEL = "openai.gpt-5.5";
+
+export interface CliRunnerConfig {
+  readonly harness: Harness;
+  /**
+   * Bedrock model id / inference profile. Defaults per harness
+   * ({@link DEFAULT_CLAUDE_MODEL} / {@link DEFAULT_CODEX_MODEL}).
+   */
+  readonly model?: string;
+  /**
+   * AWS region for Bedrock. Falls back to the inherited `AWS_REGION` /
+   * `AWS_DEFAULT_REGION`; an explicit value here overrides it.
+   */
+  readonly awsRegion?: string;
+  /**
+   * Test seam — inject a spawn function so unit tests don't shell out to the
+   * real CLIs. Production leaves this unset.
+   */
+  readonly _spawn?: SpawnFn;
+  /**
+   * Test seam — inject the base environment (defaults to `process.env`) so
+   * tests can assert the Bedrock vars are layered on without depending on the
+   * host env.
+   */
+  readonly _baseEnv?: NodeJS.ProcessEnv;
+}
+
+/** Minimal spawn result the runner consumes. */
+export interface SpawnResult {
+  readonly stdout: string;
+  readonly stderr: string;
+  /** Process exit code; null on signal termination. */
+  readonly code: number | null;
+}
+
+/** Spawn function shape (the `_spawn` seam). */
+export type SpawnFn = (args: {
+  readonly command: string;
+  readonly argv: readonly string[];
+  readonly env: NodeJS.ProcessEnv;
+  readonly cwd: string;
+  readonly stdin: string;
+}) => Promise<SpawnResult>;
+
+/**
+ * Build the child-process environment for a harness, layering the Bedrock
+ * wiring (§4a) over the inherited base env. Pure + exported for tests.
+ */
+export function buildAgentEnv(config: CliRunnerConfig): NodeJS.ProcessEnv {
+  const base = config._baseEnv ?? process.env;
+  const region = config.awsRegion ?? base["AWS_REGION"] ?? base["AWS_DEFAULT_REGION"];
+  const env: NodeJS.ProcessEnv = { ...base };
+  if (region !== undefined) env["AWS_REGION"] = region;
+
+  if (config.harness === "claude") {
+    // Claude Code → Bedrock. Credentials resolve via the default AWS SDK chain
+    // already present in `base` (env keys / AWS_PROFILE / SSO / bearer token);
+    // we only assert the toggle + model.
+    env["CLAUDE_CODE_USE_BEDROCK"] = "1";
+    env["ANTHROPIC_MODEL"] = config.model ?? DEFAULT_CLAUDE_MODEL;
+  }
+  // Codex selects Bedrock via `-c model_provider=amazon-bedrock` on the
+  // argv (see buildArgv), not via env; its AWS auth is inherited from base.
+  return env;
+}
+
+/**
+ * Compose the prompt handed to the agent. The with-pack arm prepends the OCH
+ * pack context; the without-pack arm is the bare instruction. Pure + exported.
+ */
+export function composePrompt(request: RunRequest): string {
+  if (request.withPack && request.packContext !== undefined && request.packContext.length > 0) {
+    return `${request.packContext}\n\n---\n\n${request.task.instruction}`;
+  }
+  return request.task.instruction;
+}
+
+/** Build the argv for a harness invocation. Pure + exported for tests. */
+export function buildArgv(
+  config: CliRunnerConfig,
+  prompt: string,
+): { command: string; argv: string[] } {
+  if (config.harness === "claude") {
+    return {
+      command: "claude",
+      argv: [
+        "-p",
+        prompt,
+        "--output-format",
+        "json",
+        "--model",
+        config.model ?? DEFAULT_CLAUDE_MODEL,
+      ],
+    };
+  }
+  // Codex: route to the first-party Bedrock provider via -c, pin the model,
+  // emit JSONL with --json, and skip the git-repo guard so the probe can run
+  // the agent against an arbitrary checkout.
+  return {
+    command: "codex",
+    argv: [
+      "exec",
+      "--json",
+      "-c",
+      "model_provider=amazon-bedrock",
+      "-m",
+      config.model ?? DEFAULT_CODEX_MODEL,
+      "--skip-git-repo-check",
+      prompt,
+    ],
+  };
+}
+
+/**
+ * Parse Claude Code's `--output-format json` single result object into a
+ * {@link RunOutcome} (minus checkout/errored, filled by the runner). Pure +
+ * exported. Throws on unparseable output so a malformed run is a hard error,
+ * not a silent zero-token success.
+ */
+export function parseClaudeOutput(stdout: string): { finalText: string; tokens: RunTokens } {
+  const doc = JSON.parse(stdout) as {
+    result?: unknown;
+    usage?: { input_tokens?: unknown; output_tokens?: unknown };
+    total_cost_usd?: unknown;
+  };
+  const finalText = typeof doc.result === "string" ? doc.result : "";
+  const inputTokens = num(doc.usage?.input_tokens);
+  const outputTokens = num(doc.usage?.output_tokens);
+  const costUsd = typeof doc.total_cost_usd === "number" ? doc.total_cost_usd : null;
+  return { finalText, tokens: { inputTokens, outputTokens, costUsd } };
+}
+
+/**
+ * Parse Codex's `--json` JSONL stream. Final answer = the last
+ * `item.completed` whose item type is `agent_message`; tokens = the last
+ * `turn.completed.usage` (Codex reports no total — we sum input+output). Pure +
+ * exported. Tolerates interleaved non-JSON lines (progress noise) by skipping
+ * them.
+ */
+export function parseCodexOutput(stdout: string): { finalText: string; tokens: RunTokens } {
+  let finalText = "";
+  let inputTokens = 0;
+  let outputTokens = 0;
+  for (const line of stdout.split("\n")) {
+    const trimmed = line.trim();
+    if (trimmed.length === 0) continue;
+    let evt: unknown;
+    try {
+      evt = JSON.parse(trimmed);
+    } catch {
+      continue; // progress / non-JSON line — skip
+    }
+    if (typeof evt !== "object" || evt === null) continue;
+    const e = evt as {
+      type?: unknown;
+      item?: { type?: unknown; text?: unknown };
+      usage?: { input_tokens?: unknown; output_tokens?: unknown };
+    };
+    if (e.type === "item.completed" && e.item?.type === "agent_message") {
+      if (typeof e.item.text === "string") finalText = e.item.text;
+    } else if (e.type === "turn.completed" && e.usage !== undefined) {
+      inputTokens = num(e.usage.input_tokens);
+      outputTokens = num(e.usage.output_tokens);
+    }
+  }
+  // Codex does not surface a per-invocation USD cost on the public event
+  // schema, so cost is null (the report tolerates a null-cost arm).
+  return { finalText, tokens: { inputTokens, outputTokens, costUsd: null } };
+}
+
+function num(v: unknown): number {
+  return typeof v === "number" && Number.isFinite(v) ? v : 0;
+}
+
+/** Default production spawn — real `node:child_process.spawn`, buffered. */
+const defaultSpawn: SpawnFn = ({ command, argv, env, cwd, stdin }) =>
+  new Promise<SpawnResult>((resolve) => {
+    const child = spawn(command, [...argv], { env, cwd, stdio: ["pipe", "pipe", "pipe"] });
+    let stdout = "";
+    let stderr = "";
+    child.stdout.on("data", (d: Buffer) => {
+      stdout += d.toString("utf8");
+    });
+    child.stderr.on("data", (d: Buffer) => {
+      stderr += d.toString("utf8");
+    });
+    child.on("error", (err) => {
+      // Binary missing / not executable — surface as a non-zero result rather
+      // than rejecting, so the runner can wrap it in an actionable error.
+      resolve({ stdout, stderr: `${stderr}${String(err)}`, code: 127 });
+    });
+    child.on("close", (code) => resolve({ stdout, stderr, code }));
+    if (stdin.length > 0) child.stdin.write(stdin);
+    child.stdin.end();
+  });
+
+/**
+ * The direct-CLI runner. Each `run` spawns a fresh agent process (a fresh
+ * session, per §3), composes the arm-appropriate prompt, parses the harness's
+ * structured output, and returns a {@link RunOutcome}. A spawn/exit failure is
+ * captured as an `errored` outcome rather than throwing, so one crashed run
+ * doesn't abort the experiment — a pack that reduces crashes is lower variance.
+ */
+export class CliAgentRunner implements AgentRunner {
+  readonly name: string;
+  private readonly config: CliRunnerConfig;
+  private readonly spawnFn: SpawnFn;
+
+  constructor(config: CliRunnerConfig) {
+    this.config = config;
+    this.name = `cli:${config.harness}`;
+    this.spawnFn = config._spawn ?? defaultSpawn;
+  }
+
+  async run(request: RunRequest): Promise<RunOutcome> {
+    const prompt = composePrompt(request);
+    const { command, argv } = buildArgv(this.config, prompt);
+    const env = buildAgentEnv(this.config);
+    // The checkout the agent works in. v1 runs the agent against the task's
+    // repo path directly; a future iteration can clone per-run for isolation.
+    const cwd = request.task.repo;
+
+    let result: SpawnResult;
+    try {
+      result = await this.spawnFn({ command, argv, env, cwd, stdin: "" });
+    } catch (err) {
+      return erroredOutcome(cwd, `spawn failed: ${String(err)}`);
+    }
+
+    if (result.code !== 0) {
+      return erroredOutcome(cwd, result.stderr || `${command} exited ${String(result.code)}`);
+    }
+
+    try {
+      const parsed =
+        this.config.harness === "claude"
+          ? parseClaudeOutput(result.stdout)
+          : parseCodexOutput(result.stdout);
+      return {
+        finalText: parsed.finalText,
+        diff: "", // neither CLI surfaces a structured diff in headless JSON; left empty
+        tokens: parsed.tokens,
+        checkoutPath: cwd,
+        errored: false,
+      };
+    } catch (err) {
+      return erroredOutcome(cwd, `failed to parse ${command} output: ${String(err)}`);
+    }
+  }
+}
+
+function erroredOutcome(checkoutPath: string, finalText: string): RunOutcome {
+  return {
+    finalText,
+    diff: "",
+    tokens: { inputTokens: 0, outputTokens: 0, costUsd: null },
+    checkoutPath,
+    errored: true,
+  };
+}
diff --git a/packages/eval/src/dispersion.test.ts b/packages/eval/src/dispersion.test.ts
new file mode 100644
index 0000000..09557a4
--- /dev/null
+++ b/packages/eval/src/dispersion.test.ts
@@ -0,0 +1,120 @@
+import { strict as assert } from "node:assert";
+import { describe, it } from "node:test";
+import {
+  type ArmDispersion,
+  bernoulliDispersion,
+  dispersionScalar,
+  distinctOutputRatio,
+  mean,
+  populationStddev,
+} from "./dispersion.js";
+
+describe("populationStddev", () => {
+  it("returns 0 for fewer than two values", () => {
+    assert.equal(populationStddev([]), 0);
+    assert.equal(populationStddev([5]), 0);
+  });
+
+  it("is zero for a constant sample", () => {
+    assert.equal(populationStddev([3, 3, 3, 3]), 0);
+  });
+
+  it("computes the population (not sample) stddev", () => {
+    // values [0,1] → mean 0.5, variance ((.25)+(.25))/2 = .25, stddev .5
+    assert.equal(populationStddev([0, 1]), 0.5);
+    // [2,4,4,4,5,5,7,9] is the classic example: population stddev = 2
+    assert.equal(populationStddev([2, 4, 4, 4, 5, 5, 7, 9]), 2);
+  });
+});
+
+describe("mean", () => {
+  it("returns 0 for empty input", () => {
+    assert.equal(mean([]), 0);
+  });
+  it("averages", () => {
+    assert.equal(mean([1, 2, 3, 4]), 2.5);
+  });
+});
+
+describe("distinctOutputRatio", () => {
+  it("returns 0 for empty input", () => {
+    assert.equal(distinctOutputRatio([]), 0);
+  });
+  it("is 1/N when every output is identical (perfectly stable)", () => {
+    assert.equal(distinctOutputRatio(["a", "a", "a", "a"]), 0.25);
+  });
+  it("is 1.0 when every output differs (maximally unstable)", () => {
+    assert.equal(distinctOutputRatio(["a", "b", "c"]), 1);
+  });
+  it("counts distinct values", () => {
+    assert.equal(distinctOutputRatio(["a", "a", "b", "b"]), 0.5);
+  });
+});
+
+describe("bernoulliDispersion", () => {
+  it("returns zeros for empty input", () => {
+    assert.deepEqual(bernoulliDispersion([]), { passRate: 0, stddev: 0 });
+  });
+  it("is zero-dispersion when the agent is perfectly consistent (all pass)", () => {
+    const d = bernoulliDispersion([true, true, true, true]);
+    assert.equal(d.passRate, 1);
+    assert.equal(d.stddev, 0);
+  });
+  it("is zero-dispersion when perfectly consistent (all fail)", () => {
+    const d = bernoulliDispersion([false, false, false]);
+    assert.equal(d.passRate, 0);
+    assert.equal(d.stddev, 0);
+  });
+  it("is maximal at a 50/50 coin-flip agent", () => {
+    const d = bernoulliDispersion([true, false, true, false]);
+    assert.equal(d.passRate, 0.5);
+    assert.equal(d.stddev, 0.5); // sqrt(0.5*0.5)
+  });
+  it("captures the headline example: 6/10 with vs 3/10 without", () => {
+    const withPack = bernoulliDispersion([
+      true,
+      true,
+      true,
+      true,
+      true,
+      true,
+      false,
+      false,
+      false,
+      false,
+    ]);
+    const withoutPack = bernoulliDispersion([
+      true,
+      true,
+      true,
+      false,
+      false,
+      false,
+      false,
+      false,
+      false,
+      false,
+    ]);
+    assert.equal(withPack.passRate, 0.6);
+    assert.equal(withoutPack.passRate, 0.3);
+    // without-pack stddev (p=0.3) is higher than the toward-extreme... actually
+    // 0.6 is closer to 0.5 than 0.3, so with-pack stddev is HIGHER here — the
+    // pass-rate is the headline; stddev measures distance from a decided agent.
+    assert.ok(withPack.stddev > 0 && withoutPack.stddev > 0);
+  });
+});
+
+describe("dispersionScalar", () => {
+  it("uses the distinct ratio for output_hash", () => {
+    const d: ArmDispersion = { kind: "output_hash", distinctRatio: 0.7, runs: 10 };
+    assert.equal(dispersionScalar(d), 0.7);
+  });
+  it("uses the stddev for assertion", () => {
+    const d: ArmDispersion = { kind: "assertion", passRate: 0.6, stddev: 0.49, runs: 10 };
+    assert.equal(dispersionScalar(d), 0.49);
+  });
+  it("uses the stddev for judge", () => {
+    const d: ArmDispersion = { kind: "judge", meanScore: 0.8, stddev: 0.12, runs: 10 };
+    assert.equal(dispersionScalar(d), 0.12);
+  });
+});
diff --git a/packages/eval/src/dispersion.ts b/packages/eval/src/dispersion.ts
new file mode 100644
index 0000000..689b778
--- /dev/null
+++ b/packages/eval/src/dispersion.ts
@@ -0,0 +1,99 @@
+/**
+ * Dispersion statistics — the per-arm "how much did the answer wander" numbers
+ * (spec 010 §2). Pure functions over an arm's N outcomes; no I/O, no clock.
+ * These are the most safety-critical part of the probe, so they're isolated
+ * here and unit-covered exhaustively.
+ *
+ * Each oracle type maps to one dispersion statistic:
+ *   - `output_hash` → distinct-output ratio  = (# distinct outputs) / N
+ *   - `assertion`   → pass-rate + failure-rate stddev (Bernoulli) across N
+ *   - `judge`       → stddev of rubric scores across N
+ *
+ * Lower dispersion = the agent's behavior is more stable run-to-run. The
+ * Move-2 claim holds when the with-pack arm's dispersion is materially below
+ * the without-pack arm's.
+ */
+
+/** Population standard deviation of a sample. Returns 0 for <2 values. */
+export function populationStddev(values: readonly number[]): number {
+  const n = values.length;
+  if (n < 2) return 0;
+  let sum = 0;
+  for (const v of values) sum += v;
+  const mean = sum / n;
+  let acc = 0;
+  for (const v of values) {
+    const d = v - mean;
+    acc += d * d;
+  }
+  return Math.sqrt(acc / n);
+}
+
+/** Arithmetic mean. Returns 0 for an empty input. */
+export function mean(values: readonly number[]): number {
+  const n = values.length;
+  if (n === 0) return 0;
+  let sum = 0;
+  for (const v of values) sum += v;
+  return sum / n;
+}
+
+/**
+ * Distinct-output ratio for the `output_hash` oracle. `1.0` = every run
+ * produced a different output (maximally unstable); `1/N` = every run produced
+ * the same output (perfectly stable). Returns 0 for an empty input.
+ */
+export function distinctOutputRatio(outputs: readonly string[]): number {
+  const n = outputs.length;
+  if (n === 0) return 0;
+  return new Set(outputs).size / n;
+}
+
+/**
+ * Bernoulli (pass/fail) dispersion for the `assertion` oracle. `passes[i]` is
+ * true when run i passed. Returns the pass rate and the population stddev of
+ * the pass indicator (treating pass=1, fail=0). For a Bernoulli sample the
+ * stddev is `sqrt(p*(1-p))`, maximized at p=0.5 (a coin-flip agent) and zero
+ * when the agent is perfectly consistent (all-pass or all-fail).
+ */
+export function bernoulliDispersion(passes: readonly boolean[]): {
+  readonly passRate: number;
+  readonly stddev: number;
+} {
+  const n = passes.length;
+  if (n === 0) return { passRate: 0, stddev: 0 };
+  const indicators = passes.map((p) => (p ? 1 : 0));
+  return { passRate: mean(indicators), stddev: populationStddev(indicators) };
+}
+
+/** Discriminated dispersion result, tagged by the oracle that produced it. */
+export type ArmDispersion =
+  | { readonly kind: "output_hash"; readonly distinctRatio: number; readonly runs: number }
+  | {
+      readonly kind: "assertion";
+      readonly passRate: number;
+      readonly stddev: number;
+      readonly runs: number;
+    }
+  | {
+      readonly kind: "judge";
+      readonly meanScore: number;
+      readonly stddev: number;
+      readonly runs: number;
+    };
+
+/**
+ * The single scalar each `ArmDispersion` reduces to for the with/without delta.
+ * Lower = more stable. For `output_hash` it's the distinct ratio; for
+ * `assertion` and `judge` it's the stddev (pass-rate / score stability).
+ */
+export function dispersionScalar(d: ArmDispersion): number {
+  switch (d.kind) {
+    case "output_hash":
+      return d.distinctRatio;
+    case "assertion":
+      return d.stddev;
+    case "judge":
+      return d.stddev;
+  }
+}
diff --git a/packages/eval/src/index.ts b/packages/eval/src/index.ts
new file mode 100644
index 0000000..65efabd
--- /dev/null
+++ b/packages/eval/src/index.ts
@@ -0,0 +1,66 @@
+/**
+ * `@opencodehub/eval` — the variance probe (spec 010 / Move 2).
+ *
+ * Public surface: load a task, run the with/without experiment via a runner,
+ * score it with the task's oracle, and emit a deterministic report. v1 ships
+ * the Bedrock-wired direct-CLI runner; the probe core is runner-agnostic.
+ */
+
+export {
+  buildAgentEnv,
+  buildArgv,
+  CliAgentRunner,
+  type CliRunnerConfig,
+  composePrompt,
+  DEFAULT_CLAUDE_MODEL,
+  DEFAULT_CODEX_MODEL,
+  parseClaudeOutput,
+  parseCodexOutput,
+  type SpawnFn,
+  type SpawnResult,
+} from "./cli-runner.js";
+export {
+  type ArmDispersion,
+  bernoulliDispersion,
+  dispersionScalar,
+  distinctOutputRatio,
+  mean,
+  populationStddev,
+} from "./dispersion.js";
+export { type JudgeScorer, type ScoreOptions, scoreArm } from "./oracle.js";
+export {
+  DEFAULT_RUNS,
+  type ProbeOptions,
+  type ProbeRunEvent,
+  probeHarness,
+  resolveHarnesses,
+  runProbe,
+} from "./probe.js";
+export {
+  type ArmReport,
+  type ArmTokens,
+  buildHarnessReport,
+  formatReport,
+  type HarnessReport,
+  serializeReport,
+  TOKEN_OVERHEAD_FLAG,
+  type VarianceReport,
+} from "./report.js";
+export type {
+  AgentRunner,
+  Harness,
+  RunOutcome,
+  RunRequest,
+  RunTokens,
+} from "./runner.js";
+export {
+  type AssertionOracle,
+  type JudgeOracle,
+  loadTask,
+  type Oracle,
+  OracleSchema,
+  type OutputHashOracle,
+  type Task,
+  TaskSchema,
+  TaskValidationError,
+} from "./task.js";
diff --git a/packages/eval/src/oracle.test.ts b/packages/eval/src/oracle.test.ts
new file mode 100644
index 0000000..6565e4c
--- /dev/null
+++ b/packages/eval/src/oracle.test.ts
@@ -0,0 +1,150 @@
+import { strict as assert } from "node:assert";
+import { mkdtemp, rm } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { after, before, describe, it } from "node:test";
+import { scoreArm } from "./oracle.js";
+import type { RunOutcome } from "./runner.js";
+import type { AssertionOracle, JudgeOracle, OutputHashOracle } from "./task.js";
+
+const outcome = (over: Partial<RunOutcome>): RunOutcome => ({
+  finalText: "",
+  diff: "",
+  tokens: { inputTokens: 0, outputTokens: 0, costUsd: null },
+  errored: false,
+  ...over,
+});
+
+describe("scoreArm — output_hash", () => {
+  const oracle: OutputHashOracle = { type: "output_hash", field: "final_text" };
+
+  it("is 1/N (stable) when every final_text is identical", async () => {
+    const outcomes = [
+      outcome({ finalText: "same" }),
+      outcome({ finalText: "same" }),
+      outcome({ finalText: "same" }),
+    ];
+    const d = await scoreArm(oracle, outcomes);
+    assert.equal(d.kind, "output_hash");
+    if (d.kind === "output_hash") assert.ok(Math.abs(d.distinctRatio - 1 / 3) < 1e-9);
+  });
+
+  it("counts an errored run as its own distinct outcome", async () => {
+    const outcomes = [
+      outcome({ finalText: "x" }),
+      outcome({ finalText: "x" }),
+      outcome({ finalText: "x", errored: true }),
+    ];
+    const d = await scoreArm(oracle, outcomes);
+    if (d.kind === "output_hash") assert.ok(Math.abs(d.distinctRatio - 2 / 3) < 1e-9);
+  });
+
+  it("hashes the diff field when configured", async () => {
+    const diffOracle: OutputHashOracle = { type: "output_hash", field: "diff" };
+    const outcomes = [
+      outcome({ finalText: "differ", diff: "same-diff" }),
+      outcome({ finalText: "wildly-different", diff: "same-diff" }),
+    ];
+    const d = await scoreArm(diffOracle, outcomes);
+    if (d.kind === "output_hash") assert.equal(d.distinctRatio, 0.5, "1 distinct diff over 2 runs");
+  });
+});
+
+describe("scoreArm — assertion", () => {
+  let dir: string;
+  before(async () => {
+    dir = await mkdtemp(join(tmpdir(), "och-eval-oracle-"));
+  });
+  after(async () => {
+    await rm(dir, { recursive: true, force: true });
+  });
+
+  it("runs the shell command in each checkout: exit 0 = pass", async () => {
+    const oracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 };
+    const outcomes = [
+      outcome({ checkoutPath: dir }),
+      outcome({ checkoutPath: dir }),
+      outcome({ checkoutPath: dir }),
+    ];
+    const d = await scoreArm(oracle, outcomes);
+    assert.equal(d.kind, "assertion");
+    if (d.kind === "assertion") {
+      assert.equal(d.passRate, 1);
+      assert.equal(d.stddev, 0);
+    }
+  });
+
+  it("scores exit-non-zero as a fail", async () => {
+    const oracle: AssertionOracle = { type: "assertion", command: "false", timeoutMs: 30_000 };
+    const d = await scoreArm(oracle, [outcome({ checkoutPath: dir })]);
+    if (d.kind === "assertion") assert.equal(d.passRate, 0);
+  });
+
+  it("scores an errored run (no checkout) as a fail without spawning", async () => {
+    const oracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 };
+    const d = await scoreArm(oracle, [outcome({ errored: true })]);
+    if (d.kind === "assertion") assert.equal(d.passRate, 0);
+  });
+
+  it("produces mixed pass-rate dispersion", async () => {
+    // command passes only when the marker file exists; we toggle via cwd trick:
+    // here just alternate true/false commands to simulate a flaky agent.
+    const passOracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 };
+    const failOracle: AssertionOracle = { type: "assertion", command: "false", timeoutMs: 30_000 };
+    const pass = await scoreArm(passOracle, [
+      outcome({ checkoutPath: dir }),
+      outcome({ checkoutPath: dir }),
+    ]);
+    const fail = await scoreArm(failOracle, [
+      outcome({ checkoutPath: dir }),
+      outcome({ checkoutPath: dir }),
+    ]);
+    if (pass.kind === "assertion" && fail.kind === "assertion") {
+      assert.equal(pass.passRate, 1);
+      assert.equal(fail.passRate, 0);
+    }
+  });
+});
+
+describe("scoreArm — judge", () => {
+  const oracle: JudgeOracle = { type: "judge", rubric: "score correctness", panel: 2 };
+
+  it("throws a clear error when no JudgeScorer is supplied", async () => {
+    await assert.rejects(
+      () => scoreArm(oracle, [outcome({ finalText: "x" })]),
+      /requires a JudgeScorer/,
+    );
+  });
+
+  it("averages the panel and computes score stddev", async () => {
+    // Deterministic judge: score = length-based fraction so two outcomes differ.
+    const judge = async (o: RunOutcome): Promise<number> => (o.finalText === "good" ? 0.9 : 0.1);
+    const d = await scoreArm(
+      oracle,
+      [outcome({ finalText: "good" }), outcome({ finalText: "bad" })],
+      { judge },
+    );
+    assert.equal(d.kind, "judge");
+    if (d.kind === "judge") {
+      assert.ok(Math.abs(d.meanScore - 0.5) < 1e-9);
+      assert.ok(d.stddev > 0);
+    }
+  });
+
+  it("scores an errored run as 0 without calling the judge", async () => {
+    let calls = 0;
+    const judge = async (): Promise<number> => {
+      calls += 1;
+      return 1;
+    };
+    const d = await scoreArm(oracle, [outcome({ errored: true })], { judge });
+    if (d.kind === "judge") assert.equal(d.meanScore, 0);
+    assert.equal(calls, 0, "judge not called for an errored run");
+  });
+
+  it("clamps out-of-range judge scores into [0,1]", async () => {
+    const judge = async (): Promise<number> => 5; // misbehaving judge
+    const d = await scoreArm(oracle, [outcome({ finalText: "x" })], { judge });
+    if (d.kind === "judge") assert.equal(d.meanScore, 1);
+  });
+});
diff --git a/packages/eval/src/oracle.ts b/packages/eval/src/oracle.ts
new file mode 100644
index 0000000..ced4b77
--- /dev/null
+++ b/packages/eval/src/oracle.ts
@@ -0,0 +1,139 @@
+/**
+ * Oracle scoring — reduce an arm's N {@link RunOutcome}s to an
+ * {@link ArmDispersion} (spec 010 §2).
+ *
+ *   - `output_hash` — pure: hash the configured field, count distinct values.
+ *   - `assertion`   — run a deterministic shell check against each run's
+ *     checkout; pass/fail → Bernoulli dispersion. A run that errored, or whose
+ *     command exceeds its timeout, scores as a fail (the worst outcome).
+ *   - `judge`       — LLM-panel rubric scoring. The panel call is injected as a
+ *     dependency so the probe core stays free of any model client; v1 ships the
+ *     interface and a guard that fails fast if no judge function is supplied.
+ *
+ * The `assertion` path is the only one that touches the filesystem / spawns a
+ * process; it is kept here (not in the pure dispersion module) so
+ * `dispersion.ts` remains a pure, exhaustively-testable leaf.
+ */
+
+import { spawn } from "node:child_process";
+import { sha256Hex } from "@opencodehub/core-types";
+import {
+  type ArmDispersion,
+  bernoulliDispersion,
+  distinctOutputRatio,
+  populationStddev,
+} from "./dispersion.js";
+import type { RunOutcome } from "./runner.js";
+import type { AssertionOracle, JudgeOracle, Oracle, OutputHashOracle } from "./task.js";
+
+/** A judge-panel scorer: maps one run's outcome to a 0..1 rubric score. */
+export type JudgeScorer = (outcome: RunOutcome, rubric: string) => Promise<number>;
+
+export interface ScoreOptions {
+  /** Required only when the task's oracle is `judge`. */
+  readonly judge?: JudgeScorer;
+}
+
+/** Score an arm's outcomes into a dispersion, dispatching on the oracle type. */
+export async function scoreArm(
+  oracle: Oracle,
+  outcomes: readonly RunOutcome[],
+  options: ScoreOptions = {},
+): Promise<ArmDispersion> {
+  switch (oracle.type) {
+    case "output_hash":
+      return scoreOutputHash(oracle, outcomes);
+    case "assertion":
+      return scoreAssertion(oracle, outcomes);
+    case "judge":
+      return scoreJudge(oracle, outcomes, options.judge);
+  }
+}
+
+function scoreOutputHash(oracle: OutputHashOracle, outcomes: readonly RunOutcome[]): ArmDispersion {
+  const hashes = outcomes.map((o) => {
+    const text = oracle.field === "diff" ? o.diff : o.finalText;
+    // An errored run hashes to a sentinel so a crash counts as its own distinct
+    // outcome rather than colliding with an empty-answer success.
+    return o.errored ? `__errored__:${sha256Hex(text)}` : sha256Hex(text);
+  });
+  return { kind: "output_hash", distinctRatio: distinctOutputRatio(hashes), runs: outcomes.length };
+}
+
+async function scoreAssertion(
+  oracle: AssertionOracle,
+  outcomes: readonly RunOutcome[],
+): Promise<ArmDispersion> {
+  const passes: boolean[] = [];
+  for (const outcome of outcomes) {
+    if (outcome.errored || outcome.checkoutPath === undefined) {
+      // No checkout to assert against, or the agent crashed → worst outcome.
+      passes.push(false);
+      continue;
+    }
+    passes.push(await runAssertionCommand(oracle, outcome.checkoutPath));
+  }
+  const { passRate, stddev } = bernoulliDispersion(passes);
+  return { kind: "assertion", passRate, stddev, runs: outcomes.length };
+}
+
+async function scoreJudge(
+  oracle: JudgeOracle,
+  outcomes: readonly RunOutcome[],
+  judge: JudgeScorer | undefined,
+): Promise<ArmDispersion> {
+  if (judge === undefined) {
+    throw new Error(
+      "eval: the `judge` oracle requires a JudgeScorer to be supplied to scoreArm; " +
+        "none was provided. Pass `options.judge` or use the `assertion`/`output_hash` oracle.",
+    );
+  }
+  const scores: number[] = [];
+  for (const outcome of outcomes) {
+    if (outcome.errored) {
+      scores.push(0); // a crash is the worst rubric score
+      continue;
+    }
+    // Average the panel: call the judge `panel` times and mean the results, so
+    // judge-side noise doesn't inflate the agent-side variance we're measuring.
+    const panelScores: number[] = [];
+    for (let i = 0; i < oracle.panel; i += 1) {
+      panelScores.push(clamp01(await judge(outcome, oracle.rubric)));
+    }
+    scores.push(panelScores.reduce((a, b) => a + b, 0) / panelScores.length);
+  }
+  const meanScore = scores.length === 0 ? 0 : scores.reduce((a, b) => a + b, 0) / scores.length;
+  return { kind: "judge", meanScore, stddev: populationStddev(scores), runs: outcomes.length };
+}
+
+function clamp01(n: number): number {
+  if (!Number.isFinite(n)) return 0;
+  if (n < 0) return 0;
+  if (n > 1) return 1;
+  return n;
+}
+
+/**
+ * Run the assertion command in the run's checkout. Resolves `true` on exit
+ * code 0, `false` otherwise (including spawn failure and timeout). Never
+ * rejects — a broken check is a failed assertion, not a probe crash.
+ */
+function runAssertionCommand(oracle: AssertionOracle, checkoutPath: string): Promise<boolean> {
+  return new Promise((resolve) => {
+    const cwd = oracle.cwd !== undefined ? `${checkoutPath}/${oracle.cwd}` : checkoutPath;
+    const child = spawn(oracle.command, {
+      cwd,
+      shell: true,
+      stdio: "ignore",
+      timeout: oracle.timeoutMs,
+    });
+    let settled = false;
+    const settle = (passed: boolean): void => {
+      if (settled) return;
+      settled = true;
+      resolve(passed);
+    };
+    child.on("error", () => settle(false));
+    child.on("close", (code) => settle(code === 0));
+  });
+}
diff --git a/packages/eval/src/probe.test.ts b/packages/eval/src/probe.test.ts
new file mode 100644
index 0000000..13c73d0
--- /dev/null
+++ b/packages/eval/src/probe.test.ts
@@ -0,0 +1,127 @@
+import { strict as assert } from "node:assert";
+import { describe, it } from "node:test";
+import { DEFAULT_RUNS, resolveHarnesses, runProbe } from "./probe.js";
+import { serializeReport } from "./report.js";
+import type { AgentRunner, Harness, RunOutcome, RunRequest } from "./runner.js";
+import type { Task } from "./task.js";
+
+const TASK: Task = {
+  id: "demo",
+  repo: "/tmp/repo",
+  commit: "c0ffee",
+  instruction: "Add a --json flag.",
+  oracle: { type: "output_hash", field: "final_text" },
+};
+
+/**
+ * A deterministic fake runner: in the with-pack arm it returns a constant
+ * answer (perfectly stable); in the without-pack arm it returns a per-run
+ * distinct answer (maximally unstable). This is the canonical Move-2 shape —
+ * the pack should drive the dispersion delta strongly positive.
+ */
+class FakeRunner implements AgentRunner {
+  readonly name: string;
+  private i = 0;
+  constructor(public readonly harness: Harness) {
+    this.name = `fake:${harness}`;
+  }
+  run(request: RunRequest): Promise<RunOutcome> {
+    this.i += 1;
+    const finalText = request.withPack ? "stable answer" : `wandering answer ${this.i}`;
+    return Promise.resolve({
+      finalText,
+      diff: "",
+      tokens: {
+        inputTokens: request.withPack ? 1100 : 1000,
+        outputTokens: 100,
+        costUsd: null,
+      },
+      errored: false,
+    });
+  }
+}
+
+describe("resolveHarnesses", () => {
+  it("defaults to both agents when the task pins none", () => {
+    assert.deepEqual(resolveHarnesses(TASK, { packContext: "" }), ["claude", "codex"]);
+  });
+  it("uses the task's pinned harness", () => {
+    assert.deepEqual(resolveHarnesses({ ...TASK, harness: "codex" }, { packContext: "" }), [
+      "codex",
+    ]);
+  });
+  it("honors an explicit options.harnesses override", () => {
+    assert.deepEqual(resolveHarnesses(TASK, { packContext: "", harnesses: ["claude"] }), [
+      "claude",
+    ]);
+  });
+});
+
+describe("runProbe (end-to-end with a fake runner)", () => {
+  it("measures a strong positive dispersion delta when the pack stabilizes answers", async () => {
+    const report = await runProbe(TASK, (h) => new FakeRunner(h), {
+      runs: 4,
+      packContext: "PACK BODY",
+      harnesses: ["claude"],
+    });
+    assert.equal(report.schema, 1);
+    assert.equal(report.taskId, "demo");
+    assert.equal(report.harnesses.length, 1);
+    const h = report.harnesses[0];
+    assert.ok(h !== undefined);
+    // without-pack: 4 distinct answers → distinctRatio 1.0
+    // with-pack:    1 distinct answer  → distinctRatio 0.25
+    if (h.without.dispersion.kind === "output_hash") {
+      assert.equal(h.without.dispersion.distinctRatio, 1);
+    }
+    if (h.with.dispersion.kind === "output_hash") {
+      assert.equal(h.with.dispersion.distinctRatio, 0.25);
+    }
+    assert.ok(Math.abs(h.dispersionDelta - 0.75) < 1e-9, "delta = 1.0 − 0.25");
+    // token overhead: with=1200/run, without=1100/run → 1.0909...
+    assert.ok(h.tokenOverhead > 1 && h.tokenOverhead < 1.3);
+    assert.equal(h.tokenOverheadFlagged, false);
+  });
+
+  it("runs both harnesses by default", async () => {
+    const report = await runProbe(TASK, (h) => new FakeRunner(h), {
+      runs: 2,
+      packContext: "PACK",
+    });
+    assert.deepEqual(
+      report.harnesses.map((h) => h.harness),
+      ["claude", "codex"],
+    );
+  });
+
+  it("emits a byte-identical report across two identical probe runs (R6)", async () => {
+    const opts = { runs: 3, packContext: "PACK", harnesses: ["codex"] as Harness[] };
+    const a = await runProbe(TASK, (h) => new FakeRunner(h), opts);
+    const b = await runProbe(TASK, (h) => new FakeRunner(h), opts);
+    assert.equal(serializeReport(a), serializeReport(b));
+  });
+
+  it("defaults to DEFAULT_RUNS when runs is omitted", async () => {
+    const report = await runProbe(TASK, (h) => new FakeRunner(h), {
+      packContext: "PACK",
+      harnesses: ["claude"],
+    });
+    assert.equal(report.harnesses[0]?.runs, DEFAULT_RUNS);
+  });
+
+  it("invokes the per-run progress callback for both arms", async () => {
+    const events: string[] = [];
+    await runProbe(TASK, (h) => new FakeRunner(h), {
+      runs: 2,
+      packContext: "PACK",
+      harnesses: ["claude"],
+      onRun: (e) => events.push(`${e.harness}:${e.arm}:${e.index}/${e.runs}`),
+    });
+    assert.deepEqual(events, [
+      "claude:without:1/2",
+      "claude:without:2/2",
+      "claude:with:1/2",
+      "claude:with:2/2",
+    ]);
+  });
+});
diff --git a/packages/eval/src/probe.ts b/packages/eval/src/probe.ts
new file mode 100644
index 0000000..4432638
--- /dev/null
+++ b/packages/eval/src/probe.ts
@@ -0,0 +1,153 @@
+/**
+ * The variance-probe experiment loop (spec 010 §3).
+ *
+ * For each harness, for each arm (without-pack, with-pack), run the agent N
+ * times via the injected {@link AgentRunner}, capturing each outcome. Score the
+ * arm's outcomes with the task's oracle, sum tokens, and assemble a
+ * {@link HarnessReport}. The loop is harness- and inference-backend-agnostic:
+ * everything agent-specific lives behind the runner; everything model-specific
+ * (Bedrock wiring) lives in the runner impl.
+ *
+ * Controls that keep the number honest (§3): same instruction / commit / agent
+ * / model across both arms — the runner enforces this by taking identical
+ * inputs and flipping only `withPack`. Fresh session per run is the runner's
+ * responsibility (the CLI runner spawns a new process each call).
+ */
+
+import { type ScoreOptions, scoreArm } from "./oracle.js";
+import {
+  type ArmReport,
+  buildHarnessReport,
+  type HarnessReport,
+  type VarianceReport,
+} from "./report.js";
+import type { AgentRunner, Harness, RunOutcome } from "./runner.js";
+import type { Task } from "./task.js";
+
+/** Default runs per arm (spec 010 §7.2). */
+export const DEFAULT_RUNS = 10;
+
+export interface ProbeOptions {
+  /** Runs per arm. Defaults to {@link DEFAULT_RUNS}. */
+  readonly runs?: number;
+  /**
+   * Which harnesses to run. Defaults to the task's `harness` (one) or the
+   * full set ["claude", "codex"] when the task pins none (§7.3).
+   */
+  readonly harnesses?: readonly Harness[];
+  /**
+   * The OCH pack context to inject in the with-pack arm. The CLI generates
+   * this once per task and passes it in, so the probe never imports
+   * `@opencodehub/pack` (keeps the package graph acyclic).
+   */
+  readonly packContext: string;
+  /** Required only when the task's oracle is `judge`. */
+  readonly score?: ScoreOptions;
+  /**
+   * Per-run progress callback. Pure-side-channel — never affects the report.
+   */
+  readonly onRun?: (event: ProbeRunEvent) => void;
+}
+
+export interface ProbeRunEvent {
+  readonly harness: Harness;
+  readonly arm: "without" | "with";
+  /** 1-based run index within the arm. */
+  readonly index: number;
+  readonly runs: number;
+}
+
+/** Resolve which harnesses to run from the task + options. */
+export function resolveHarnesses(task: Task, options: ProbeOptions): readonly Harness[] {
+  if (options.harnesses !== undefined && options.harnesses.length > 0) {
+    return options.harnesses;
+  }
+  if (task.harness !== undefined) return [task.harness];
+  return ["claude", "codex"];
+}
+
+/** Sum an arm's per-run token accounting into {@link ArmReport} totals. */
+function sumTokens(outcomes: readonly RunOutcome[]): ArmReport["tokens"] {
+  let inputTokens = 0;
+  let outputTokens = 0;
+  let costUsd = 0;
+  let everyRunHadCost = true;
+  for (const o of outcomes) {
+    inputTokens += o.tokens.inputTokens;
+    outputTokens += o.tokens.outputTokens;
+    if (o.tokens.costUsd === null) everyRunHadCost = false;
+    else costUsd += o.tokens.costUsd;
+  }
+  return { inputTokens, outputTokens, costUsd: everyRunHadCost ? costUsd : null };
+}
+
+/** Run one arm: N invocations of the agent, then score + total tokens. */
+async function runArm(
+  runner: AgentRunner,
+  task: Task,
+  harness: Harness,
+  withPack: boolean,
+  options: ProbeOptions,
+  runs: number,
+): Promise<ArmReport> {
+  const packContext = options.packContext;
+  const outcomes: RunOutcome[] = [];
+  for (let i = 0; i < runs; i += 1) {
+    options.onRun?.({ harness, arm: withPack ? "with" : "without", index: i + 1, runs });
+    const outcome = await runner.run({
+      task,
+      harness,
+      withPack,
+      ...(withPack ? { packContext } : {}),
+    });
+    outcomes.push(outcome);
+  }
+  const dispersion = await scoreArm(task.oracle, outcomes, options.score ?? {});
+  return { dispersion, tokens: sumTokens(outcomes) };
+}
+
+/**
+ * Run the full probe for one harness: both arms + report assembly.
+ */
+export async function probeHarness(
+  runner: AgentRunner,
+  task: Task,
+  harness: Harness,
+  options: ProbeOptions,
+): Promise<HarnessReport> {
+  const runs = options.runs ?? DEFAULT_RUNS;
+  // without-pack first, then with-pack — order is immaterial to the report
+  // (it's a pure function of the captured outcomes), but running the cheaper
+  // baseline arm first surfaces a misconfigured runner before the with-pack
+  // arm spends pack-inflated tokens.
+  const without = await runArm(runner, task, harness, false, options, runs);
+  const withPack = await runArm(runner, task, harness, true, options, runs);
+  return buildHarnessReport({
+    harness,
+    runner: runner.name,
+    runs,
+    without,
+    with: withPack,
+  });
+}
+
+/**
+ * Run the variance probe across every resolved harness and assemble the
+ * top-level {@link VarianceReport}.
+ *
+ * `runnerFor` maps a harness to its runner — the CLI layer supplies a factory
+ * that returns the Bedrock-wired direct-CLI runner for each agent. Injecting it
+ * keeps the probe core free of any process-spawning code.
+ */
+export async function runProbe(
+  task: Task,
+  runnerFor: (harness: Harness) => AgentRunner,
+  options: ProbeOptions,
+): Promise<VarianceReport> {
+  const harnesses = resolveHarnesses(task, options);
+  const reports: HarnessReport[] = [];
+  for (const harness of harnesses) {
+    reports.push(await probeHarness(runnerFor(harness), task, harness, options));
+  }
+  return { schema: 1, taskId: task.id, harnesses: reports };
+}
diff --git a/packages/eval/src/report.test.ts b/packages/eval/src/report.test.ts
new file mode 100644
index 0000000..81c887d
--- /dev/null
+++ b/packages/eval/src/report.test.ts
@@ -0,0 +1,121 @@
+import { strict as assert } from "node:assert";
+import { describe, it } from "node:test";
+import type { ArmDispersion } from "./dispersion.js";
+import {
+  type ArmReport,
+  buildHarnessReport,
+  formatReport,
+  serializeReport,
+  TOKEN_OVERHEAD_FLAG,
+  type VarianceReport,
+} from "./report.js";
+
+const assertionDispersion = (passRate: number, stddev: number): ArmDispersion => ({
+  kind: "assertion",
+  passRate,
+  stddev,
+  runs: 10,
+});
+
+const arm = (stddev: number, input: number, output: number): ArmReport => ({
+  dispersion: assertionDispersion(0.5, stddev),
+  tokens: { inputTokens: input, outputTokens: output, costUsd: null },
+});
+
+describe("buildHarnessReport", () => {
+  it("computes the without − with dispersion delta (positive = pack helped)", () => {
+    const report = buildHarnessReport({
+      harness: "claude",
+      runner: "cli:claude",
+      runs: 10,
+      without: arm(0.5, 1000, 200),
+      with: arm(0.2, 1100, 220),
+    });
+    assert.ok(Math.abs(report.dispersionDelta - 0.3) < 1e-9, "0.5 − 0.2 = 0.3");
+  });
+
+  it("computes token overhead as with/without totals", () => {
+    const report = buildHarnessReport({
+      harness: "codex",
+      runner: "cli:codex",
+      runs: 10,
+      without: arm(0.5, 1000, 0),
+      with: arm(0.2, 1100, 0),
+    });
+    assert.ok(Math.abs(report.tokenOverhead - 1.1) < 1e-9);
+    assert.equal(report.tokenOverheadFlagged, false, "1.1× is under the 1.3× flag");
+  });
+
+  it("flags when token overhead exceeds the guardrail", () => {
+    const report = buildHarnessReport({
+      harness: "claude",
+      runner: "cli:claude",
+      runs: 10,
+      without: arm(0.5, 1000, 0),
+      with: arm(0.2, 1400, 0),
+    });
+    assert.ok(report.tokenOverhead > TOKEN_OVERHEAD_FLAG);
+    assert.equal(report.tokenOverheadFlagged, true);
+  });
+
+  it("reports overhead 0 (never Infinity/NaN) when the baseline arm spent no tokens", () => {
+    const report = buildHarnessReport({
+      harness: "claude",
+      runner: "cli:claude",
+      runs: 1,
+      without: arm(0, 0, 0),
+      with: arm(0, 500, 100),
+    });
+    assert.equal(report.tokenOverhead, 0);
+    assert.equal(report.tokenOverheadFlagged, false);
+  });
+});
+
+describe("serializeReport (determinism, R6)", () => {
+  const report: VarianceReport = {
+    schema: 1,
+    taskId: "demo-task",
+    harnesses: [
+      buildHarnessReport({
+        harness: "claude",
+        runner: "cli:claude",
+        runs: 10,
+        without: arm(0.5, 1000, 200),
+        with: arm(0.2, 1100, 220),
+      }),
+    ],
+  };
+
+  it("is a pure function of the report (byte-identical across calls)", () => {
+    assert.equal(serializeReport(report), serializeReport(report));
+  });
+
+  it("sorts object keys canonically (no clock/run-id leaks in)", () => {
+    const json = serializeReport(report);
+    // canonicalJson sorts keys: "harnesses" before "schema" before "taskId".
+    assert.ok(json.startsWith('{"harnesses":'));
+    assert.ok(!json.includes("Date"));
+    assert.ok(!json.includes("timestamp"));
+  });
+});
+
+describe("formatReport", () => {
+  it("renders the flag marker when overhead is high", () => {
+    const flagged: VarianceReport = {
+      schema: 1,
+      taskId: "t",
+      harnesses: [
+        buildHarnessReport({
+          harness: "claude",
+          runner: "cli:claude",
+          runs: 10,
+          without: arm(0.5, 1000, 0),
+          with: arm(0.2, 1500, 0),
+        }),
+      ],
+    };
+    const text = formatReport(flagged);
+    assert.ok(text.includes("FLAG"));
+    assert.ok(text.includes("token overhead"));
+  });
+});
diff --git a/packages/eval/src/report.ts b/packages/eval/src/report.ts
new file mode 100644
index 0000000..ec31209
--- /dev/null
+++ b/packages/eval/src/report.ts
@@ -0,0 +1,145 @@
+/**
+ * The variance-probe report (spec 010 §5, R6).
+ *
+ * The report is a **pure function of the captured run outcomes** — no
+ * wall-clock, no run-id, no absolute paths. Two probe runs over the same
+ * captured outcomes serialize byte-identically (the context-bom discipline).
+ * Serialization goes through `core-types`' `canonicalJson` (sorted keys), so
+ * the emitted `--json` is reproducible.
+ *
+ * Token overhead is a first-class output, not a footnote: the paper's claim is
+ * "halves variance at ~10% more tokens", so a probe that halves variance at
+ * 3× tokens is a worse story. The report flags (never fails) when overhead
+ * exceeds {@link TOKEN_OVERHEAD_FLAG} — "you bought stability expensively."
+ */
+
+import { canonicalJson } from "@opencodehub/core-types";
+import type { ArmDispersion } from "./dispersion.js";
+import { dispersionScalar } from "./dispersion.js";
+
+/**
+ * Token-overhead guardrail (spec 010 §7.4). Above this ratio the report flags
+ * that stability was bought expensively. A reported constant, never a gate.
+ */
+export const TOKEN_OVERHEAD_FLAG = 1.3;
+
+/** Aggregate token totals for one arm. */
+export interface ArmTokens {
+  readonly inputTokens: number;
+  readonly outputTokens: number;
+  /** Sum of per-run cost when every run reported it; `null` otherwise. */
+  readonly costUsd: number | null;
+}
+
+/** One arm's measured result (with-pack or without-pack). */
+export interface ArmReport {
+  readonly dispersion: ArmDispersion;
+  readonly tokens: ArmTokens;
+}
+
+/** The full per-harness probe result. */
+export interface HarnessReport {
+  /** Which agent produced this result (e.g. "claude", "codex"). */
+  readonly harness: string;
+  /** Runner name (e.g. "cli:claude"). */
+  readonly runner: string;
+  readonly runs: number;
+  readonly without: ArmReport;
+  readonly with: ArmReport;
+  /**
+   * `without − with` of the dispersion scalar. Positive = the pack reduced
+   * variance (the Move-2 claim). The headline number.
+   */
+  readonly dispersionDelta: number;
+  /** with-pack tokens / without-pack tokens (total in+out). */
+  readonly tokenOverhead: number;
+  /** True when `tokenOverhead` exceeds {@link TOKEN_OVERHEAD_FLAG}. */
+  readonly tokenOverheadFlagged: boolean;
+}
+
+/** The top-level report the probe emits. */
+export interface VarianceReport {
+  /** Report schema version, so consumers can branch on shape changes. */
+  readonly schema: 1;
+  /** The task id this report measures. */
+  readonly taskId: string;
+  /** One entry per harness the probe ran. */
+  readonly harnesses: readonly HarnessReport[];
+}
+
+/** Sum input+output tokens for an arm. */
+function totalTokens(t: ArmTokens): number {
+  return t.inputTokens + t.outputTokens;
+}
+
+/**
+ * Assemble a {@link HarnessReport} from two scored arms + their token totals.
+ * Pure: identical inputs → identical output.
+ */
+export function buildHarnessReport(input: {
+  readonly harness: string;
+  readonly runner: string;
+  readonly runs: number;
+  readonly without: ArmReport;
+  readonly with: ArmReport;
+}): HarnessReport {
+  const dispersionDelta =
+    dispersionScalar(input.without.dispersion) - dispersionScalar(input.with.dispersion);
+  const withoutTotal = totalTokens(input.without.tokens);
+  const withTotal = totalTokens(input.with.tokens);
+  // Overhead is undefined when the baseline arm spent no tokens; report 0 in
+  // that degenerate case rather than Infinity/NaN, and never flag it.
+  const tokenOverhead = withoutTotal === 0 ? 0 : withTotal / withoutTotal;
+  return {
+    harness: input.harness,
+    runner: input.runner,
+    runs: input.runs,
+    without: input.without,
+    with: input.with,
+    dispersionDelta,
+    tokenOverhead,
+    tokenOverheadFlagged: tokenOverhead > TOKEN_OVERHEAD_FLAG,
+  };
+}
+
+/**
+ * Canonical JSON for the report (R6). Sorted keys, no clock/run-id — byte-stable
+ * across processes given the same captured outcomes.
+ */
+export function serializeReport(report: VarianceReport): string {
+  return canonicalJson(report);
+}
+
+/**
+ * Render a short human-readable summary of a report. Kept separate from the
+ * machine JSON so the CLI can print one or the other.
+ */
+export function formatReport(report: VarianceReport): string {
+  const lines: string[] = [];
+  lines.push(`Variance probe — task: ${report.taskId}`);
+  for (const h of report.harnesses) {
+    lines.push("");
+    lines.push(`  ${h.harness} (${h.runner}, N=${h.runs})`);
+    lines.push(`    without-pack dispersion: ${fmtDispersion(h.without.dispersion)}`);
+    lines.push(`    with-pack dispersion:    ${fmtDispersion(h.with.dispersion)}`);
+    lines.push(`    delta (without − with):  ${h.dispersionDelta.toFixed(4)}`);
+    lines.push(
+      `    token overhead:          ${h.tokenOverhead.toFixed(3)}×` +
+        (h.tokenOverheadFlagged
+          ? `  [FLAG: > ${TOKEN_OVERHEAD_FLAG}× — stability bought expensively]`
+          : ""),
+    );
+  }
+  return lines.join("\n");
+}
+
+function fmtDispersion(d: ArmDispersion): string {
+  switch (d.kind) {
+    case "output_hash":
+      return `distinct-output ratio ${d.distinctRatio.toFixed(4)}`;
+    case "assertion":
+      return `pass-rate ${d.passRate.toFixed(4)} (stddev ${d.stddev.toFixed(4)})`;
+    case "judge":
+      return `mean-score ${d.meanScore.toFixed(4)} (stddev ${d.stddev.toFixed(4)})`;
+  }
+}
diff --git a/packages/eval/src/runner.ts b/packages/eval/src/runner.ts
new file mode 100644
index 0000000..d826284
--- /dev/null
+++ b/packages/eval/src/runner.ts
@@ -0,0 +1,86 @@
+/**
+ * `AgentRunner` — the harness-agnostic seam the probe drives (spec 010 §4).
+ *
+ * The probe core knows nothing about *how* an agent runs; it only asks a
+ * runner to execute a task in one arm and hand back the captured outcome. v1
+ * ships a direct-CLI runner (`cli-runner.ts`) that shells out to `claude -p` /
+ * `codex exec` with Bedrock wired (§4a). A future omnigent-backed runner (v2)
+ * implements the same interface and drops in without touching the probe.
+ *
+ * Determinism note: a runner is free to be nondeterministic *within* an arm
+ * (that's the variance being measured) but must hold every controlled input —
+ * commit, instruction, agent, model — identical *between* arms. The only
+ * manipulated variable is `withPack`.
+ */
+
+import type { Task } from "./task.js";
+
+/** Which coding agent a runner drives. */
+export type Harness = "claude" | "codex";
+
+/** Inputs handed to a runner for a single agent invocation. */
+export interface RunRequest {
+  /** The task being run (repo/commit/instruction). */
+  readonly task: Task;
+  /** Which agent to drive. */
+  readonly harness: Harness;
+  /**
+   * When true, the OCH code-pack for `repo@commit` is injected into the
+   * agent's context (the with-pack arm); when false, only the bare
+   * instruction is given (the without-pack arm).
+   */
+  readonly withPack: boolean;
+  /**
+   * The pack context to inject when `withPack` is true. The CLI layer
+   * generates the pack once per task and threads it in here, so the probe
+   * core (and `@opencodehub/eval`) never depends on `@opencodehub/pack` —
+   * keeping the package boundary acyclic. Absent on the without-pack arm.
+   */
+  readonly packContext?: string;
+}
+
+/** Token accounting captured from one agent run. */
+export interface RunTokens {
+  readonly inputTokens: number;
+  readonly outputTokens: number;
+  /** Total cost in USD when the harness reports it; `null` when unavailable. */
+  readonly costUsd: number | null;
+}
+
+/** What a runner captures from a single agent invocation. */
+export interface RunOutcome {
+  /** The agent's final answer text. */
+  readonly finalText: string;
+  /**
+   * The unified diff the agent produced, when the harness exposes one.
+   * Empty string when the run produced no patch or the harness doesn't
+   * surface diffs.
+   */
+  readonly diff: string;
+  /** Token + cost accounting for the run. */
+  readonly tokens: RunTokens;
+  /**
+   * Path to the (possibly mutated) repo checkout this run produced, so an
+   * `assertion` oracle can execute its check against the run's result. Absent
+   * when the runner does not materialize a per-run checkout.
+   */
+  readonly checkoutPath?: string;
+  /**
+   * True when the agent invocation itself failed (non-zero exit, crash,
+   * timeout) as opposed to completing with a (possibly wrong) answer. A failed
+   * run still counts toward the arm — a pack that makes the agent crash less is
+   * lower variance — but the oracle treats it as the worst outcome.
+   */
+  readonly errored: boolean;
+}
+
+/**
+ * Runs an agent for a single task invocation. Implementations: the direct-CLI
+ * runner (v1) and, later, an omnigent-backed runner (v2).
+ */
+export interface AgentRunner {
+  /** Stable name for the report (e.g. "cli:claude", "cli:codex"). */
+  readonly name: string;
+  /** Execute one invocation and capture its outcome. */
+  run(request: RunRequest): Promise<RunOutcome>;
+}
diff --git a/packages/eval/src/task.test.ts b/packages/eval/src/task.test.ts
new file mode 100644
index 0000000..eeff721
--- /dev/null
+++ b/packages/eval/src/task.test.ts
@@ -0,0 +1,116 @@
+import { strict as assert } from "node:assert";
+import { mkdtemp, rm, writeFile } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { after, before, describe, it } from "node:test";
+import { loadTask, TaskValidationError } from "./task.js";
+
+describe("loadTask", () => {
+  let dir: string;
+  before(async () => {
+    dir = await mkdtemp(join(tmpdir(), "och-eval-task-"));
+  });
+  after(async () => {
+    await rm(dir, { recursive: true, force: true });
+  });
+
+  async function write(name: string, body: string): Promise<string> {
+    const p = join(dir, name);
+    await write_(p, body);
+    return p;
+  }
+  async function write_(p: string, body: string): Promise<void> {
+    await writeFile(p, body, "utf8");
+  }
+
+  it("loads a valid YAML task with an assertion oracle and applies defaults", async () => {
+    const p = await write(
+      "ok.yaml",
+      [
+        "id: add-json-flag",
+        "repo: /tmp/some-repo",
+        "commit: abc123",
+        "instruction: Add a --json flag to status.",
+        "oracle:",
+        "  type: assertion",
+        "  command: npm test",
+        "",
+      ].join("\n"),
+    );
+    const task = await loadTask(p);
+    assert.equal(task.id, "add-json-flag");
+    assert.equal(task.oracle.type, "assertion");
+    if (task.oracle.type === "assertion") {
+      assert.equal(task.oracle.command, "npm test");
+      assert.equal(task.oracle.timeoutMs, 120_000, "default timeout applied");
+    }
+    assert.equal(task.harness, undefined, "harness optional");
+  });
+
+  it("loads a valid JSON task (yaml parser is a JSON superset)", async () => {
+    const p = await write(
+      "ok.json",
+      JSON.stringify({
+        id: "t",
+        repo: "/r",
+        commit: "c",
+        instruction: "do it",
+        oracle: { type: "output_hash" },
+        harness: "claude",
+      }),
+    );
+    const task = await loadTask(p);
+    assert.equal(task.harness, "claude");
+    assert.equal(task.oracle.type, "output_hash");
+    if (task.oracle.type === "output_hash") {
+      assert.equal(task.oracle.field, "final_text", "default field applied");
+    }
+  });
+
+  it("throws TaskValidationError on a missing file", async () => {
+    await assert.rejects(() => loadTask(join(dir, "nope.yaml")), TaskValidationError);
+  });
+
+  it("throws TaskValidationError on an empty file", async () => {
+    const p = await write("empty.yaml", "");
+    await assert.rejects(() => loadTask(p), TaskValidationError);
+  });
+
+  it("throws TaskValidationError on a schema violation (missing instruction)", async () => {
+    const p = await write(
+      "bad.yaml",
+      ["id: t", "repo: /r", "commit: c", "oracle:", "  type: output_hash", ""].join("\n"),
+    );
+    await assert.rejects(
+      () => loadTask(p),
+      (err: unknown) => err instanceof TaskValidationError && /instruction/.test(err.message),
+    );
+  });
+
+  it("rejects an unknown oracle type", async () => {
+    const p = await write(
+      "badoracle.yaml",
+      ["id: t", "repo: /r", "commit: c", "instruction: x", "oracle:", "  type: telepathy", ""].join(
+        "\n",
+      ),
+    );
+    await assert.rejects(() => loadTask(p), TaskValidationError);
+  });
+
+  it("rejects unknown top-level keys (strict schema)", async () => {
+    const p = await write(
+      "extra.yaml",
+      [
+        "id: t",
+        "repo: /r",
+        "commit: c",
+        "instruction: x",
+        "oracle:",
+        "  type: output_hash",
+        "surprise: true",
+        "",
+      ].join("\n"),
+    );
+    await assert.rejects(() => loadTask(p), TaskValidationError);
+  });
+});
diff --git a/packages/eval/src/task.ts b/packages/eval/src/task.ts
new file mode 100644
index 0000000..dda6288
--- /dev/null
+++ b/packages/eval/src/task.ts
@@ -0,0 +1,160 @@
+/**
+ * Task definition + loader for the variance probe.
+ *
+ * A **task** is the fixed unit the probe runs repeatedly (spec 010 §2):
+ *
+ *   > a triple `(repo @ commit, instruction, success_oracle)` run by a coding
+ *   > agent, where the agent's only variable input across the experiment is
+ *   > whether the OCH pack is in its context.
+ *
+ * The task file is a small YAML or JSON document. We validate it with Zod so a
+ * malformed task surfaces a precise error rather than a cryptic runtime failure
+ * mid-experiment (the experiment costs real agent minutes — fail fast at load).
+ *
+ * Three oracle shapes, in increasing cost (§2):
+ *   - `output_hash` — no scoring agent; dispersion = distinct-output ratio.
+ *   - `assertion`   — a deterministic shell check; dispersion = pass-rate stddev.
+ *   - `judge`       — an LLM-panel rubric; dispersion = stddev of scores.
+ */
+
+import { readFile } from "node:fs/promises";
+import { parse as parseYaml, YAMLParseError } from "yaml";
+import { z } from "zod";
+
+/** Oracle that scores each run by whether its output text differs across N. */
+export const OutputHashOracleSchema = z
+  .object({
+    type: z.literal("output_hash"),
+    /**
+     * Which captured field to hash for the distinct-output ratio. `final_text`
+     * (the agent's answer) is the default; `diff` hashes the produced patch.
+     */
+    field: z.enum(["final_text", "diff"]).default("final_text"),
+  })
+  .strict();
+
+/** Oracle that scores each run pass/fail via a deterministic shell command. */
+export const AssertionOracleSchema = z
+  .object({
+    type: z.literal("assertion"),
+    /**
+     * Shell command run in the (post-run) repo checkout. Exit code 0 = pass,
+     * non-zero = fail. This is the most defensible "variance" — it's objective.
+     */
+    command: z.string().min(1),
+    /**
+     * Optional working directory for the command, relative to the run's repo
+     * checkout. Defaults to the checkout root.
+     */
+    cwd: z.string().optional(),
+    /** Per-command timeout in milliseconds. Defaults to 120_000. */
+    timeoutMs: z.number().int().positive().default(120_000),
+  })
+  .strict();
+
+/** Oracle that scores each run 0..1 via an LLM-judge panel rubric. */
+export const JudgeOracleSchema = z
+  .object({
+    type: z.literal("judge"),
+    /** The rubric handed to the judge panel, verbatim. */
+    rubric: z.string().min(1),
+    /** Panel size — how many independent judge runs to average per outcome. */
+    panel: z.number().int().positive().default(3),
+  })
+  .strict();
+
+export const OracleSchema = z.discriminatedUnion("type", [
+  OutputHashOracleSchema,
+  AssertionOracleSchema,
+  JudgeOracleSchema,
+]);
+
+export const TaskSchema = z
+  .object({
+    /** Human-facing task id (used to label the report). */
+    id: z.string().min(1),
+    /** Repo location — a local path or a clonable git URL. */
+    repo: z.string().min(1),
+    /** Pinned commit SHA. Frozen so the pack is the only variable. */
+    commit: z.string().min(1),
+    /** The natural-language ask, given verbatim to the agent every run. */
+    instruction: z.string().min(1),
+    /** How a run is scored. */
+    oracle: OracleSchema,
+    /**
+     * Optional harness selector. Omitted → the probe runs the configured
+     * default set (Claude Code + Codex). A value pins one agent.
+     */
+    harness: z.enum(["claude", "codex"]).optional(),
+  })
+  .strict();
+
+export type OutputHashOracle = z.infer<typeof OutputHashOracleSchema>;
+export type AssertionOracle = z.infer<typeof AssertionOracleSchema>;
+export type JudgeOracle = z.infer<typeof JudgeOracleSchema>;
+export type Oracle = z.infer<typeof OracleSchema>;
+export type Task = z.infer<typeof TaskSchema>;
+
+export class TaskValidationError extends Error {
+  override readonly name = "TaskValidationError";
+}
+
+interface NodeFsError {
+  readonly code?: string;
+}
+
+function isEnoent(err: unknown): boolean {
+  if (typeof err !== "object" || err === null) return false;
+  return (err as NodeFsError).code === "ENOENT";
+}
+
+/**
+ * Read + parse + validate a task file. YAML and JSON are both accepted — the
+ * `yaml` parser is a JSON superset, so one code path handles both. A missing
+ * file, malformed document, or schema violation throws {@link TaskValidationError}
+ * with a precise message, so the probe never starts an expensive experiment on
+ * a bad task.
+ */
+export async function loadTask(filePath: string): Promise<Task> {
+  let raw: string;
+  try {
+    raw = await readFile(filePath, "utf8");
+  } catch (err) {
+    if (isEnoent(err)) {
+      throw new TaskValidationError(`task file not found: ${filePath}`);
+    }
+    throw err;
+  }
+
+  let parsed: unknown;
+  try {
+    parsed = parseYaml(raw);
+  } catch (err) {
+    const message =
+      err instanceof YAMLParseError
+        ? err.message
+        : err instanceof Error
+          ? err.message
+          : String(err);
+    throw new TaskValidationError(`failed to parse ${filePath}: ${message}`);
+  }
+
+  if (parsed === null || parsed === undefined) {
+    throw new TaskValidationError(`task file is empty: ${filePath}`);
+  }
+
+  const result = TaskSchema.safeParse(parsed);
+  if (!result.success) {
+    throw new TaskValidationError(`invalid task in ${filePath}: ${formatZodError(result.error)}`);
+  }
+  return result.data;
+}
+
+function formatZodError(error: z.ZodError): string {
+  const parts: string[] = [];
+  for (const issue of error.issues) {
+    const path = issue.path.length > 0 ? issue.path.map((seg) => String(seg)).join(".") : "<root>";
+    parts.push(`${path}: ${issue.message}`);
+  }
+  return parts.join("; ");
+}
diff --git a/packages/eval/tsconfig.json b/packages/eval/tsconfig.json
new file mode 100644
index 0000000..60268a8
--- /dev/null
+++ b/packages/eval/tsconfig.json
@@ -0,0 +1,10 @@
+{
+  "extends": "../../tsconfig.base.json",
+  "compilerOptions": {
+    "rootDir": "src",
+    "outDir": "dist",
+    "composite": true
+  },
+  "include": ["src/**/*"],
+  "references": [{ "path": "../core-types" }]
+}
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index 4af5725..afefa57 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -173,6 +173,9 @@ importers:
       '@opencodehub/embedder':
         specifier: workspace:*
         version: link:../embedder
+      '@opencodehub/eval':
+        specifier: workspace:*
+        version: link:../eval
       '@opencodehub/ingestion':
         specifier: workspace:*
         version: link:../ingestion
@@ -296,6 +299,25 @@ importers:
         specifier: 1.27.0
         version: 1.27.0
 
+  packages/eval:
+    dependencies:
+      '@opencodehub/core-types':
+        specifier: workspace:*
+        version: link:../core-types
+      yaml:
+        specifier: 2.9.0
+        version: 2.9.0
+      zod:
+        specifier: 4.4.3
+        version: 4.4.3
+    devDependencies:
+      '@types/node':
+        specifier: 26.0.1
+        version: 26.0.1
+      typescript:
+        specifier: 6.0.3
+        version: 6.0.3
+
   packages/frameworks:
     dependencies:
       '@iarna/toml':
@@ -7850,7 +7872,7 @@ snapshots:
 
   '@types/sax@1.2.7':
     dependencies:
-      '@types/node': 24.13.2
+      '@types/node': 26.0.1
 
   '@types/spdx-correct@3.1.3': {}
 
@@ -11926,3 +11948,9 @@ snapshots:
   zod@4.4.3: {}
 
   zwitch@2.0.4: {}
+
+time:
+  '@types/node@26.0.1': '2026-06-24T20:33:01.352Z'
+  typescript@6.0.3: '2026-04-16T23:38:27.905Z'
+  yaml@2.9.0: '2026-05-11T10:16:24.045Z'
+  zod@4.4.3: '2026-05-04T07:06:40.819Z'
diff --git a/tsconfig.json b/tsconfig.json
index da0bfd8..9340841 100644
--- a/tsconfig.json
+++ b/tsconfig.json
@@ -11,6 +11,7 @@
     { "path": "./packages/analysis" },
     { "path": "./packages/pack" },
     { "path": "./packages/policy" },
+    { "path": "./packages/eval" },
     { "path": "./packages/mcp" },
     { "path": "./packages/cli" },
     { "path": "./packages/summarizer" },