From cd61e5e02fb91431b9941b26f835ba833fb7006a Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Tue, 30 Jun 2026 03:49:39 +0000 Subject: [PATCH 1/2] =?UTF-8?q?docs:=20spec=20010=20=E2=80=94=20pack=20--v?= =?UTF-8?q?ariance-probe=20(draft=20for=20review)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Move 2 spec, built on the Move 6 ruling (contract pivots byte-identity to decision-equivalence). Defines "task" as a fixed (repo@commit, instruction, success_oracle) triple; three oracle types (output-hash / assertion / judge) with a precise per-arm dispersion metric each; the with/without experimental design with token-overhead as a first-class output; and an AgentRunner interface with an omnigent-backed default (grounded: alpha, 5.5k stars, Apache-2.0, drives Claude Code + Codex from one harness) plus a dependency-light direct-CLI fallback. Lands in the packages/eval stub. 7 EARS reqs + 5 open questions for review. NO implementation — review gate first. --- .erpaval/specs/010-variance-probe/spec.md | 114 ++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 .erpaval/specs/010-variance-probe/spec.md diff --git a/.erpaval/specs/010-variance-probe/spec.md b/.erpaval/specs/010-variance-probe/spec.md new file mode 100644 index 0000000..68c03c0 --- /dev/null +++ b/.erpaval/specs/010-variance-probe/spec.md @@ -0,0 +1,114 @@ +# Spec 010 — `pack --variance-probe`: measure the variance an OCH pack removes + +**Status:** Draft for review (NO code yet — Laith wants to understand deeply first) +**Author:** Bonk + Laith · **Date:** 2026-06-30 +**Branch:** `spec/010-variance-probe` (off `main` @ `6b4d122`) +**Roadmap origin:** M-W-F run 2026-06-29, Move 2 (HIGH). Depends on the Move 6 ruling (below). +**Grounding validator:** arXiv:2606.26979 — deterministic anchoring "roughly halves run-to-run variance at ~10% more tokens." + +--- + +## 0. The Move 6 ruling this spec is built on (decision-equivalence) + +Laith ruled (2026-06-30): **pivot the contract from byte-identity to decision-equivalence.** This reframes what `--variance-probe` measures, so it's stated up front. + +- **Old contract (struck):** "same `(commit, tokenizer, budget, pins)` ⇒ byte-identical pack." Brittle — the #252 embedder swap (F2LLM-v2-80M, 320-dim) breaks byte-identity while the retrieval decision is unchanged. An auditor doesn't care about bytes; they care whether the agent saw the right thing. +- **New contract:** "same inputs ⇒ provably the **same retrieval decision set** (same files + byte ranges selected under the same budget)." Byte-identity becomes one cheap *witness* of decision-equivalence, not the contract itself. +- **Why this matters for Move 2:** the variance probe is *how you measure the new contract holding*. If the pack genuinely pins the agent's decision, run-to-run answer variance drops. The probe turns "decision-equivalence" from an assertion into a number. So Move 6 (the contract) and Move 2 (the measurement) are one story: **Move 2 is the instrument that proves Move 6's claim.** +- Move 6 also implies a sibling `codehub replay` that asserts decision-equivalence structurally (same selection, not same bytes) — specced separately (011); 010 is the empirical/behavioral half. + +--- + +## 1. Diagnosis — why this move, why now + +OCH's pitch has leaned on the adjective "deterministic." That word is now contested vocabulary (LeanCTX "token receipt", Rel(AI)Build, the receipt-tool cluster all claim it). A **measured variance delta** is a number rivals can't claim by relabeling. arXiv:2606.26979 (Jun 2026) gives the citable result — deterministic anchoring halves agent run-to-run variance — and `--variance-probe` is how OCH demonstrates *its own* pack does this on a real repo, not just cites a paper. + +The headline this earns: **"an OCH pack halves how much a coding agent's answer wanders run-to-run, at ~10% token cost"** — backed by a reproducible measurement the user can run. + +## 2. The hard part: what is a "task"? (Laith's first question) + +A "task" must be precise enough to run repeatedly and score. Definition: + +> A **task** is a fixed triple `(repo @ commit, instruction, success_oracle)` run by a coding agent, where the agent's only variable input across the experiment is **whether the OCH pack is in its context**. + +- **`repo @ commit`** — a pinned checkout. Frozen so the *only* variable is the pack. +- **`instruction`** — a natural-language ask given verbatim to the agent every run (e.g. "Add a `--json` flag to `codehub status` that prints the staleness record"). Stored as a string in the task file; never paraphrased between runs. +- **`success_oracle`** — how a run is scored. Three oracle types, in increasing cost: + 1. **`output_hash`** (cheapest, no scoring agent) — variance = how often the agent's *final answer text / produced diff* differs across N runs. This measures raw output dispersion. Good for "did the agent converge". + 2. **`assertion`** — a deterministic check the run either passes or fails (a test command exit code, a grep for a required symbol, a file-exists check). Variance = pass-rate dispersion (e.g. 6/10 with pack vs 3/10 without). This is the most defensible "variance" because it's objective. + 3. **`judge`** (most expensive) — an LLM-judge panel scores each run's answer 0–1 on a rubric; variance = stddev of the scores. Reserved for tasks with no mechanical oracle. + +**What "variance" means precisely:** for a task run N times in each arm (with-pack / without-pack), variance is a per-arm dispersion statistic over the N outcomes: +- `output_hash` oracle → **distinct-output ratio** = `(# distinct outputs) / N` (1.0 = every run different, 1/N = perfectly stable). +- `assertion` oracle → **failure-rate stddev** across N, or simply pass-rate (a stabler pass-rate IS lower behavioral variance). +- `judge` oracle → **stddev of rubric scores** across N. + +The probe reports each arm's dispersion and the **delta** (without − with). The Move-2 claim holds when the with-pack arm's dispersion is materially lower. + +## 3. The experiment design (with / without) + +``` +for arm in [without_pack, with_pack]: + for i in 1..N: # N = --runs, default 10 + fresh agent session (no carryover) + context = instruction + + (arm == with_pack ? the OCH code-pack for repo@commit : nothing) + run agent → capture (final_text, diff, oracle_result, tokens) + dispersion[arm] = oracleDispersion(results[arm]) +report { without: dispersion[without_pack], with: dispersion[with_pack], + delta, tokenOverhead: tokens[with]/tokens[without] } +``` + +Controls that make the number honest: +- **Same instruction, same commit, same agent, same model** across both arms — the pack is the only manipulated variable. +- **Fresh session per run** — no conversational carryover contaminating variance. +- **Token overhead reported alongside** — the paper's claim is "halves variance at ~10% more tokens"; a probe that halves variance at 3× tokens is a different (worse) story, so cost is a first-class output, not a footnote. +- **Determinism of the probe itself** — temperature/seed are pinned where the harness allows; where it doesn't (most agents are nondeterministic by design), that nondeterminism IS the variance being measured, so it's left free *within* an arm but identical *between* arms. + +## 4. Harness: how do we actually run the agent? (Laith's omnigent suggestion) + +Grounded against the real repo (`omnigent-ai/omnigent`, 5,494★, Apache-2.0, Python 3.12, pushed 2026-06-30, **status: alpha**): + +Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Codex, Cursor, OpenCode, Kimi, and custom agents. An "agent" is a YAML (`executor.harness` + `prompt`); you run it with `omnigent run --harness `, and a `sdks/python-client` drives it programmatically. **This is exactly the shape the probe needs**: one task definition, swap the harness flag, run headless, capture output. + +**Why omnigent is the right call for this probe:** +- The variance story is far stronger if it holds **across agents** (Claude Code AND Codex), not just one — "the pack halves variance regardless of which agent reads it" is the defensible, agent-neutral claim. Omnigent is the only thing that drives both from one interface without per-agent glue. +- It already solves the headless-run + output-capture + sandbox problem the probe would otherwise hand-roll. + +**The tradeoffs I'm not hiding:** +- **Alpha.** Pinning a specific commit/release is mandatory; its API may move. The probe must tolerate that (thin adapter, see §5). +- **Heavy dependency.** It pulls a server + sandbox providers. The probe should treat omnigent as an **optional, pluggable runner behind an interface**, NOT a hard dependency of `@opencodehub/eval`. Default to it; allow a simpler direct-CLI runner. +- **Credentials live in each harness's own login** (`claude` / `codex` CLI auth), not OCH config — fine for a local probe, a documented prerequisite. + +**Recommendation:** define a small `AgentRunner` interface (`run(task, withPack) → {finalText, diff, tokens}`); ship an **omnigent-backed runner** as the default multi-agent implementation and a **direct-CLI runner** (shell out to `claude -p` / `codex exec`) as the dependency-light fallback. The probe logic is harness-agnostic; the runner is swappable. This keeps an alpha dep from being load-bearing while getting the cross-agent story omnigent uniquely enables. + +## 5. Where it lives + shape + +- **`packages/eval`** — currently a stub (README only). This is its first real content: `@opencodehub/eval` gains the variance-probe core (task loading, the experiment loop, dispersion stats, the `AgentRunner` interface + the two runner impls). +- **CLI surface:** `codehub pack --variance-probe [--runs N] [--harness ] [--runner omnigent|cli]` → prints the per-arm dispersion + delta + token overhead; `--json` emits the full record. +- **Task file:** a small YAML/JSON: `{ repo, commit, instruction, oracle: {type, ...}, harness? }`. +- **Determinism of the probe's own output:** the *report* is a pure function of the captured run outcomes; no clock/run-id in the emitted record (same discipline as the context-bom), so two probe runs over the same captured outcomes serialize identically. (The agent runs themselves are nondeterministic by nature — that's the point.) + +## 6. EARS requirements (draft — for review, not yet final) + +- **R1** WHEN given a task file `(repo, commit, instruction, oracle)`, the probe SHALL run the agent N times (default 10) in each of the with-pack and without-pack arms, holding commit/instruction/agent/model fixed. +- **R2** The probe SHALL compute a per-arm dispersion statistic appropriate to the oracle type (distinct-output ratio / pass-rate stddev / judge-score stddev) and report the without−with delta. +- **R3** The probe SHALL report token overhead (with-pack tokens / without-pack tokens) alongside the variance delta — variance reduction is only meaningful against its cost. +- **R4** The agent runner SHALL be an interface with at least two implementations: an omnigent-backed multi-agent runner (default) and a direct-CLI runner (no omnigent dependency). +- **R5** WHERE omnigent (alpha) is unavailable or its API has drifted, the probe SHALL fall back to / be usable via the direct-CLI runner without code change to the probe core. +- **R6** The emitted `--json` report SHALL be a pure function of the captured run outcomes (no wall-clock/run-id), so the report serialization is reproducible given the same outcomes. +- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches. + +## 7. Open questions for Laith (review these — don't want to build past them) + +1. **Default oracle.** I lean `assertion` (objective, defensible) as the documented default, with `output_hash` as the zero-config quick look and `judge` opt-in. Agree, or do you want `judge` front-and-center for the marketing number? +2. **N (runs per arm).** 10 is the smallest N that gives a believable dispersion; 20+ is more defensible but doubles agent cost (real $ on Claude/Codex). Default 10, configurable? +3. **Which agents for the headline.** Claude Code + Codex both (the agent-neutral claim) — or is one enough for v1 and the second a follow-up? +4. **Token-overhead guardrail.** Should the probe *flag* (not fail) when variance drops but token overhead exceeds, say, 1.3× — i.e. "you bought stability expensively"? I think yes; it keeps the claim honest. +5. **Omnigent now, or CLI-runner first?** Given omnigent is alpha, a defensible v1 is: ship the probe core + the direct-CLI runner first (fast, dependency-light, proves the method), add the omnigent runner second (unlocks the cross-agent story). Or do you want omnigent in v1 because the multi-agent claim IS the point? + +## 8. What this is NOT (scope guard) + +- Not a SWE-bench-style correctness benchmark — it measures *dispersion*, not capability. (SWE-Explore / SWE-bench remain the publish-against targets, separate.) +- Not the `replay`/decision-equivalence structural check (spec 011, the other half of Move 6). +- Not a CI gate — it's an on-demand measurement (agent runs cost real money + minutes). From 38820812eb4060f9f0aa14bf5ed4baaaa76e68cf Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Tue, 30 Jun 2026 04:26:48 +0000 Subject: [PATCH 2/2] =?UTF-8?q?feat(eval):=20pack=20--variance-probe=20?= =?UTF-8?q?=E2=80=94=20measure=20the=20variance=20an=20OCH=20pack=20remove?= =?UTF-8?q?s?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements spec 010 (Move 2): the empirical instrument behind the decision-equivalence contract (Move 6). Given a task triple (repo @ commit, instruction, success_oracle), the probe runs a coding agent N times (default 10) per arm — with vs. without the OCH pack in context — and reports the run-to-run dispersion delta plus token overhead. @opencodehub/eval (new, private, dep-light — pure JS + node:child_process, honoring the package's "ships free of test-time deps" intent; force-bundled into the CLI tarball via tsup noExternal): - task loader (YAML/JSON + Zod, strict, fail-fast) - dispersion stats (distinct-output ratio / Bernoulli pass-rate stddev / judge-score stddev) — pure, exhaustively unit-covered - oracle scoring (output_hash | assertion | judge) - AgentRunner interface + the v1 direct-CLI runner - deterministic report (canonicalJson, no clock/run-id — R6) Direct-CLI runner routes BOTH agents' inference through Amazon Bedrock (spec 010 §4a, grounded against current docs, not recalled): - Claude Code: CLAUDE_CODE_USE_BEDROCK=1 + us.-prefixed ANTHROPIC_MODEL inference profile; claude -p ... --output-format json - Codex: codex exec --json -c model_provider=amazon-bedrock -m ... CLI: codehub code-pack --variance-probe [--runs N] [--harness claude|codex] [--aws-region R] [--model ID] [--json]. The command generates the pack once, assembles it into packContext, and runs the with/without experiment. On-demand only — never a CI gate (§8). omnigent-backed multi-agent runner deferred to v2 behind the same interface (CLI-first, per the approved spec). Adds `eval` to the commitlint scope-enum (new workspace package). --- .erpaval/specs/010-variance-probe/spec.md | 42 ++- commitlint.config.mjs | 1 + packages/cli/package.json | 3 +- .../cli/src/commands/variance-probe.test.ts | 130 ++++++++ packages/cli/src/commands/variance-probe.ts | 134 ++++++++ packages/cli/src/index.ts | 56 +++- packages/cli/tsconfig.json | 1 + packages/eval/README.md | 54 +++- packages/eval/examples/variance-task.yaml | 34 ++ packages/eval/package.json | 67 ++++ packages/eval/src/cli-runner.test.ts | 231 ++++++++++++++ packages/eval/src/cli-runner.ts | 292 ++++++++++++++++++ packages/eval/src/dispersion.test.ts | 120 +++++++ packages/eval/src/dispersion.ts | 99 ++++++ packages/eval/src/index.ts | 66 ++++ packages/eval/src/oracle.test.ts | 150 +++++++++ packages/eval/src/oracle.ts | 139 +++++++++ packages/eval/src/probe.test.ts | 127 ++++++++ packages/eval/src/probe.ts | 153 +++++++++ packages/eval/src/report.test.ts | 121 ++++++++ packages/eval/src/report.ts | 145 +++++++++ packages/eval/src/runner.ts | 86 ++++++ packages/eval/src/task.test.ts | 116 +++++++ packages/eval/src/task.ts | 160 ++++++++++ packages/eval/tsconfig.json | 10 + pnpm-lock.yaml | 30 +- tsconfig.json | 1 + 27 files changed, 2551 insertions(+), 17 deletions(-) create mode 100644 packages/cli/src/commands/variance-probe.test.ts create mode 100644 packages/cli/src/commands/variance-probe.ts create mode 100644 packages/eval/examples/variance-task.yaml create mode 100644 packages/eval/package.json create mode 100644 packages/eval/src/cli-runner.test.ts create mode 100644 packages/eval/src/cli-runner.ts create mode 100644 packages/eval/src/dispersion.test.ts create mode 100644 packages/eval/src/dispersion.ts create mode 100644 packages/eval/src/index.ts create mode 100644 packages/eval/src/oracle.test.ts create mode 100644 packages/eval/src/oracle.ts create mode 100644 packages/eval/src/probe.test.ts create mode 100644 packages/eval/src/probe.ts create mode 100644 packages/eval/src/report.test.ts create mode 100644 packages/eval/src/report.ts create mode 100644 packages/eval/src/runner.ts create mode 100644 packages/eval/src/task.test.ts create mode 100644 packages/eval/src/task.ts create mode 100644 packages/eval/tsconfig.json diff --git a/.erpaval/specs/010-variance-probe/spec.md b/.erpaval/specs/010-variance-probe/spec.md index 68c03c0..8bed539 100644 --- a/.erpaval/specs/010-variance-probe/spec.md +++ b/.erpaval/specs/010-variance-probe/spec.md @@ -1,11 +1,15 @@ # Spec 010 — `pack --variance-probe`: measure the variance an OCH pack removes -**Status:** Draft for review (NO code yet — Laith wants to understand deeply first) +**Status:** Approved — building (CLI-runner-first; see §7 locked answers). **Author:** Bonk + Laith · **Date:** 2026-06-30 **Branch:** `spec/010-variance-probe` (off `main` @ `6b4d122`) **Roadmap origin:** M-W-F run 2026-06-29, Move 2 (HIGH). Depends on the Move 6 ruling (below). **Grounding validator:** arXiv:2606.26979 — deterministic anchoring "roughly halves run-to-run variance at ~10% more tokens." +**Decision log (2026-06-30, Laith):** "all looks good. agree on 5. cli first. just remember for claude code and codex to use bedrock for inference. go forth." → §7 answers locked; v1 ships the direct-CLI runner; omnigent runner deferred to v2; **Claude Code + Codex inference MUST route through Amazon Bedrock** (new §4a hard constraint). + +**CLI-surface note:** the OCH code-pack command in this repo is `codehub code-pack` (the bare `codehub pack` is the legacy repomix snapshot). The probe attaches as `codehub code-pack --variance-probe `. The spec text says `pack --variance-probe` as shorthand; the implementation wires onto `code-pack`. + --- ## 0. The Move 6 ruling this spec is built on (decision-equivalence) @@ -82,6 +86,27 @@ Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Cod **Recommendation:** define a small `AgentRunner` interface (`run(task, withPack) → {finalText, diff, tokens}`); ship an **omnigent-backed runner** as the default multi-agent implementation and a **direct-CLI runner** (shell out to `claude -p` / `codex exec`) as the dependency-light fallback. The probe logic is harness-agnostic; the runner is swappable. This keeps an alpha dep from being load-bearing while getting the cross-agent story omnigent uniquely enables. +**v1 decision (Laith, 2026-06-30):** ship the **direct-CLI runner first**. omnigent (alpha) is deferred to v2. The `AgentRunner` interface is defined now so the omnigent runner drops in later without touching the probe core. + +## 4a. Bedrock inference constraint (hard requirement, Laith 2026-06-30) + +Both agents the probe drives **must run inference on Amazon Bedrock**, not the Anthropic/OpenAI first-party APIs. This is grounded against current docs (`code.claude.com/docs`, `developers.openai.com/codex`), not recalled. + +**Claude Code → Bedrock** — the runner sets these in the child process env before `claude -p`: +- `CLAUDE_CODE_USE_BEDROCK=1` — route Claude Code through Bedrock. +- `AWS_REGION` — e.g. `us-east-1`; credentials resolve via the default AWS SDK chain (env keys / `AWS_PROFILE` / SSO / `AWS_BEARER_TOKEN_BEDROCK`). The probe inherits the operator's AWS env; it does not manage creds. +- `ANTHROPIC_MODEL` — a **cross-region inference-profile ID** (the `us.` prefix is required for on-demand throughput), e.g. `us.anthropic.claude-sonnet-4-6`. Configurable per task/run; defaults to a sonnet inference profile. +- Headless invocation: `claude -p "" --output-format json --model `. +- Token usage read from the single JSON result object: `.usage.input_tokens`, `.usage.output_tokens`, `.total_cost_usd`; final answer at `.result`. +- **No temperature/seed flag exists** in the Claude Code CLI (verified). This is *consistent with* the probe design (§3): within-arm sampling nondeterminism is the variance being measured, held free within an arm and identical between arms. + +**Codex → Bedrock** — Codex ships a first-party `amazon-bedrock` provider (OpenAI-compatible "Bedrock Mantle" surface, AWS-native SigV4 / bearer-token auth). The runner invokes: +- `codex exec --json -c model_provider=amazon-bedrock -m --skip-git-repo-check ""` (model e.g. `openai.gpt-5.5`). +- Auth: `AWS_BEARER_TOKEN_BEDROCK` + `AWS_REGION`, or the AWS SDK credential chain (`AWS_PROFILE` / env keys). Commercial AWS regions only (GovCloud unsupported). +- Output is JSONL: final answer = last `item.completed` of `type:"agent_message"` (its `.text`); token usage = last `turn.completed.usage` = `{input_tokens, cached_input_tokens, output_tokens, reasoning_output_tokens}` (no `total` field — the runner sums them). + +The runner centralizes this Bedrock env/flag wiring per agent so the probe core stays inference-backend-agnostic; a future omnigent runner sets the same env on whichever harness it spawns. + ## 5. Where it lives + shape - **`packages/eval`** — currently a stub (README only). This is its first real content: `@opencodehub/eval` gains the variance-probe core (task loading, the experiment loop, dispersion stats, the `AgentRunner` interface + the two runner impls). @@ -97,15 +122,16 @@ Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Cod - **R4** The agent runner SHALL be an interface with at least two implementations: an omnigent-backed multi-agent runner (default) and a direct-CLI runner (no omnigent dependency). - **R5** WHERE omnigent (alpha) is unavailable or its API has drifted, the probe SHALL fall back to / be usable via the direct-CLI runner without code change to the probe core. - **R6** The emitted `--json` report SHALL be a pure function of the captured run outcomes (no wall-clock/run-id), so the report serialization is reproducible given the same outcomes. -- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches. +- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches. *(Deferred with the omnigent runner to v2; the version-pin guard ships when that runner does.)* +- **R8** The direct-CLI runner SHALL route both agents' inference through Amazon Bedrock (§4a): for Claude Code it SHALL set `CLAUDE_CODE_USE_BEDROCK=1`, `AWS_REGION`, and a `us.`-prefixed inference-profile `ANTHROPIC_MODEL`; for Codex it SHALL pass `-c model_provider=amazon-bedrock`. The runner SHALL surface a clear, actionable error when the agent binary is absent or Bedrock auth/region is unconfigured, rather than silently falling back to a first-party API. -## 7. Open questions for Laith (review these — don't want to build past them) +## 7. Decisions (locked 2026-06-30 — answers to the prior open questions) -1. **Default oracle.** I lean `assertion` (objective, defensible) as the documented default, with `output_hash` as the zero-config quick look and `judge` opt-in. Agree, or do you want `judge` front-and-center for the marketing number? -2. **N (runs per arm).** 10 is the smallest N that gives a believable dispersion; 20+ is more defensible but doubles agent cost (real $ on Claude/Codex). Default 10, configurable? -3. **Which agents for the headline.** Claude Code + Codex both (the agent-neutral claim) — or is one enough for v1 and the second a follow-up? -4. **Token-overhead guardrail.** Should the probe *flag* (not fail) when variance drops but token overhead exceeds, say, 1.3× — i.e. "you bought stability expensively"? I think yes; it keeps the claim honest. -5. **Omnigent now, or CLI-runner first?** Given omnigent is alpha, a defensible v1 is: ship the probe core + the direct-CLI runner first (fast, dependency-light, proves the method), add the omnigent runner second (unlocks the cross-agent story). Or do you want omnigent in v1 because the multi-agent claim IS the point? +1. **Default oracle → `assertion`.** Objective and defensible. `output_hash` is the zero-config quick look; `judge` is opt-in for tasks with no mechanical oracle. The headline number is the assertion pass-rate dispersion delta. +2. **N (runs per arm) → 10, configurable** via `--runs`. Smallest N that gives a believable dispersion; raise for a more defensible publish at linear agent-cost. +3. **Agents → Claude Code AND Codex.** The agent-neutral claim ("the pack halves variance regardless of which agent reads it") is the point. v1 ships both via the direct-CLI runner; `--harness` selects one or the default runs the configured set. +4. **Token-overhead guardrail → yes, flag (never fail).** When variance drops but token overhead exceeds ~1.3×, the report flags "stability bought expensively." Keeps the claim honest; the threshold is a reported constant, not a gate. +5. **Omnigent vs CLI-first → CLI-runner first.** v1 = probe core + direct-CLI runner (fast, dependency-light, proves the method, Bedrock-wired per §4a). omnigent runner is v2, dropping in behind the same `AgentRunner` interface. ## 8. What this is NOT (scope guard) diff --git a/commitlint.config.mjs b/commitlint.config.mjs index e43175c..29dbda1 100644 --- a/commitlint.config.mjs +++ b/commitlint.config.mjs @@ -42,6 +42,7 @@ export default { "cobol-proleap", "core-types", "embedder", + "eval", "frameworks", "ingestion", "mcp", diff --git a/packages/cli/package.json b/packages/cli/package.json index a2fdaed..c08c503 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -39,7 +39,7 @@ "test": "pnpm run build:test && node --test \"./dist-test/**/*.test.js\"", "clean": "rm -rf dist dist-test *.tsbuildinfo" }, - "//deps": "The 14 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).", + "//deps": "The 15 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).", "dependencies": { "@apidevtools/swagger-parser": "12.1.0", "@aws-sdk/client-bedrock-runtime": "3.1075.0", @@ -70,6 +70,7 @@ "@opencodehub/analysis": "workspace:*", "@opencodehub/core-types": "workspace:*", "@opencodehub/embedder": "workspace:*", + "@opencodehub/eval": "workspace:*", "@opencodehub/ingestion": "workspace:*", "@opencodehub/mcp": "workspace:*", "@opencodehub/pack": "workspace:*", diff --git a/packages/cli/src/commands/variance-probe.test.ts b/packages/cli/src/commands/variance-probe.test.ts new file mode 100644 index 0000000..2cbf9ed --- /dev/null +++ b/packages/cli/src/commands/variance-probe.test.ts @@ -0,0 +1,130 @@ +/** + * Tests for `runVarianceProbe` (the `codehub code-pack --variance-probe` + * handler) and `assemblePackContext`. + * + * Strategy: inject the `_assemblePackContext` + `_runnerFor` test seams so the + * probe runs against a fake agent and a stub pack context — no real analyzed + * repo, no `claude`/`codex` spawn, no Bedrock. `assemblePackContext` is tested + * against a real on-disk pack directory. + */ + +import { strict as assert } from "node:assert"; +import { mkdtemp, rm, writeFile } from "node:fs/promises"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { after, before, describe, it } from "node:test"; +import type { AgentRunner, Harness, RunOutcome, RunRequest } from "@opencodehub/eval"; +import { assemblePackContext, runVarianceProbe } from "./variance-probe.js"; + +/** A fake runner: stable answer with-pack, distinct answer per run without. */ +class FakeRunner implements AgentRunner { + readonly name: string; + private i = 0; + constructor(harness: Harness) { + this.name = `fake:${harness}`; + } + run(request: RunRequest): Promise { + this.i += 1; + return Promise.resolve({ + finalText: request.withPack ? "stable" : `wander-${this.i}`, + diff: "", + tokens: { inputTokens: request.withPack ? 110 : 100, outputTokens: 10, costUsd: null }, + errored: false, + }); + } +} + +describe("runVarianceProbe (seamed)", () => { + let dir: string; + let taskFile: string; + before(async () => { + dir = await mkdtemp(join(tmpdir(), "och-vp-cmd-")); + taskFile = join(dir, "task.yaml"); + await writeFile( + taskFile, + [ + "id: cmd-task", + `repo: ${dir}`, + "commit: abc", + "instruction: Add a flag.", + "oracle:", + " type: output_hash", + "", + ].join("\n"), + "utf8", + ); + }); + after(async () => { + await rm(dir, { recursive: true, force: true }); + }); + + it("loads the task, injects the stub pack, and reports a positive delta", async () => { + const report = await runVarianceProbe({ + taskFile, + runs: 3, + harness: "claude", + _assemblePackContext: async () => "STUB PACK CONTEXT", + _runnerFor: (h) => new FakeRunner(h), + }); + assert.equal(report.taskId, "cmd-task"); + assert.equal(report.harnesses.length, 1); + const h = report.harnesses[0]; + assert.ok(h !== undefined); + assert.equal(h.harness, "claude"); + assert.ok(h.dispersionDelta > 0, "the stabilizing pack drives a positive delta"); + }); + + it("passes the assembled pack context into the with-pack arm", async () => { + let withPackPrompts = 0; + const capturing: AgentRunner = { + name: "capture", + run(req: RunRequest): Promise { + if (req.withPack) { + assert.equal(req.packContext, "ASSEMBLED-PACK"); + withPackPrompts += 1; + } else { + assert.equal(req.packContext, undefined, "no pack in the without arm"); + } + return Promise.resolve({ + finalText: req.withPack ? "s" : `${Math.random()}`, + diff: "", + tokens: { inputTokens: 1, outputTokens: 1, costUsd: null }, + errored: false, + }); + }, + }; + await runVarianceProbe({ + taskFile, + runs: 2, + harness: "codex", + _assemblePackContext: async () => "ASSEMBLED-PACK", + _runnerFor: () => capturing, + }); + assert.equal(withPackPrompts, 2, "both with-pack runs saw the assembled context"); + }); +}); + +describe("assemblePackContext", () => { + let packDir: string; + before(async () => { + packDir = await mkdtemp(join(tmpdir(), "och-vp-pack-")); + await writeFile(join(packDir, "readme.md"), "# Pack readme\nhello", "utf8"); + await writeFile(join(packDir, "skeleton.jsonl"), '{"sym":"foo"}', "utf8"); + await writeFile(join(packDir, "manifest.json"), '{"packHash":"x"}', "utf8"); + await writeFile(join(packDir, "context-bom.json"), '{"components":[]}', "utf8"); + }); + after(async () => { + await rm(packDir, { recursive: true, force: true }); + }); + + it("includes the body files in sorted order and excludes manifest + context-bom", async () => { + const ctx = await assemblePackContext(packDir); + assert.ok(ctx.includes("### readme.md")); + assert.ok(ctx.includes("### skeleton.jsonl")); + assert.ok(ctx.includes("# Pack readme")); + assert.ok(!ctx.includes("manifest.json"), "manifest excluded (provenance, not content)"); + assert.ok(!ctx.includes("context-bom.json"), "context-bom excluded"); + // sorted: readme.md before skeleton.jsonl + assert.ok(ctx.indexOf("### readme.md") < ctx.indexOf("### skeleton.jsonl")); + }); +}); diff --git a/packages/cli/src/commands/variance-probe.ts b/packages/cli/src/commands/variance-probe.ts new file mode 100644 index 0000000..e439e78 --- /dev/null +++ b/packages/cli/src/commands/variance-probe.ts @@ -0,0 +1,134 @@ +/** + * `codehub code-pack --variance-probe ` — measure the run-to-run + * answer variance an OCH pack removes from a coding agent (spec 010 / Move 2). + * + * Flow: + * 1. Load + validate the task file (`@opencodehub/eval`'s `loadTask`). + * 2. Generate the OCH code-pack for the task's repo (reuses `runCodePack`), + * then assemble its on-disk artifacts into a single `packContext` string — + * what the with-pack arm injects into the agent's context. + * 3. Run the with/without experiment via the Bedrock-wired direct-CLI runner + * (`@opencodehub/eval`'s `CliAgentRunner`), N times per arm per harness. + * 4. Emit the report — human summary to stderr, `--json` to stdout. + * + * `console.log` to stdout is sanctioned in command modules (biome override); + * the JSON report goes to stdout so it pipes cleanly, the human summary to + * stderr so it never pollutes a piped stdout (the context-bom discipline). + * + * The probe is on-demand and costs real agent minutes + Bedrock spend — it is + * never a CI gate (spec 010 §8). + */ + +import { readdir, readFile } from "node:fs/promises"; +import { join } from "node:path"; +import { + type AgentRunner, + CliAgentRunner, + formatReport, + type Harness, + loadTask, + type ProbeOptions, + runProbe, + serializeReport, + type VarianceReport, +} from "@opencodehub/eval"; +import { runCodePack } from "./code-pack.js"; + +export interface VarianceProbeArgs { + /** Path to the task file (YAML or JSON). */ + readonly taskFile: string; + /** Runs per arm. Defaults to the probe's DEFAULT_RUNS (10). */ + readonly runs?: number; + /** Restrict to one harness; omitted runs the task's set (default both). */ + readonly harness?: Harness; + /** AWS region for Bedrock inference; falls back to the inherited env. */ + readonly awsRegion?: string; + /** Bedrock model / inference profile override (applies to every harness). */ + readonly model?: string; + /** + * Test seam — inject a fake pack-context assembler so unit tests don't need a + * real analyzed repo + pack on disk. + */ + readonly _assemblePackContext?: (repo: string) => Promise; + /** + * Test seam — inject a runner factory so unit tests drive a fake agent + * instead of spawning the real `claude` / `codex` CLIs. + */ + readonly _runnerFor?: (harness: Harness) => AgentRunner; +} + +/** + * Assemble the on-disk pack directory into a single context string. Reads the + * consumer-facing `readme.md` plus the BOM body files (`*.jsonl`, `*.md`) in + * sorted order, so the injected context is deterministic. The manifest + + * context-bom.json are provenance records, not agent-facing content, so they + * are skipped. + */ +export async function assemblePackContext(packOutDir: string): Promise { + const entries = await readdir(packOutDir, { withFileTypes: true }); + const files = entries + .filter((e) => e.isFile()) + .map((e) => e.name) + .filter((n) => n !== "manifest.json" && n !== "context-bom.json") + .sort(); + + const parts: string[] = []; + for (const name of files) { + const body = await readFile(join(packOutDir, name), "utf8"); + parts.push(`### ${name}\n\n${body}`); + } + return parts.join("\n\n"); +} + +/** + * Run the variance probe. Returns the report so callers (and tests) can assert + * on it; the CLI action prints it. + */ +export async function runVarianceProbe(args: VarianceProbeArgs): Promise { + const task = await loadTask(args.taskFile); + + // 1. Generate the OCH pack for the task's repo (requires it to be analyzed). + // Then assemble its artifacts into the context the with-pack arm injects. + const assemble = args._assemblePackContext ?? defaultAssemble; + const packContext = await assemble(task.repo); + + // 2. The runner factory: a Bedrock-wired direct-CLI runner per harness + // (spec 010 §4a). Tests inject a fake via `_runnerFor`. + const runnerFor = + args._runnerFor ?? + ((harness: Harness) => + new CliAgentRunner({ + harness, + ...(args.model !== undefined ? { model: args.model } : {}), + ...(args.awsRegion !== undefined ? { awsRegion: args.awsRegion } : {}), + })); + + const options: ProbeOptions = { + packContext, + ...(args.runs !== undefined ? { runs: args.runs } : {}), + ...(args.harness !== undefined ? { harnesses: [args.harness] } : {}), + }; + + return runProbe(task, runnerFor, options); +} + +/** + * Production pack-context assembler: generate the pack for `repo` via the same + * `runCodePack` the bare `code-pack` command uses, then read its artifacts. + */ +async function defaultAssemble(repo: string): Promise { + const result = await runCodePack({ repo, engine: "pack" }); + return assemblePackContext(result.outDir); +} + +/** + * Print a {@link VarianceReport}. JSON → stdout (machine consumers / `--json`); + * the human summary → stderr so it never pollutes a piped stdout. + */ +export function printVarianceReport(report: VarianceReport, asJson: boolean): void { + if (asJson) { + console.log(serializeReport(report)); + } else { + console.warn(formatReport(report)); + } +} diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index b762ea7..3b4d6f1 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -375,8 +375,49 @@ program "After packing, print a summary of the context read-receipt (files indexed, lines, " + "hash coverage, per-language breakdown) read from the pack's context-bom.json", ) - .option("--json", "With --explain-context, emit the receipt summary as JSON on stdout") + .option("--json", "With --explain-context or --variance-probe, emit the result as JSON on stdout") + .option( + "--variance-probe ", + "Measure the run-to-run answer variance an OCH pack removes from a coding agent " + + "(spec 010 / Move 2). Loads the task file, generates the pack, runs the agent N times " + + "with vs. without the pack, and reports the dispersion delta + token overhead. Agents run " + + "on Amazon Bedrock. On-demand only — costs real agent minutes + Bedrock spend, never a CI gate.", + ) + .option("--runs ", "With --variance-probe: runs per arm (default 10)", (v) => + Number.parseInt(v, 10), + ) + .option( + "--harness ", + "With --variance-probe: restrict to one agent — claude or codex (default: both)", + ) + .option( + "--aws-region ", + "With --variance-probe: AWS region for Bedrock inference (default: inherited AWS_REGION)", + ) + .option( + "--model ", + "With --variance-probe: Bedrock model / inference-profile id (default per harness)", + ) .action(async (path: string | undefined, opts: Record) => { + // --variance-probe short-circuits the normal pack path: it loads a task, + // generates the pack itself, and runs the with/without experiment. + if (typeof opts["varianceProbe"] === "string") { + const probeMod = await import("./commands/variance-probe.js"); + const harness = parseHarness(opts["harness"]); + const runs = + typeof opts["runs"] === "number" && Number.isFinite(opts["runs"]) + ? opts["runs"] + : undefined; + const report = await probeMod.runVarianceProbe({ + taskFile: opts["varianceProbe"], + ...(runs !== undefined ? { runs } : {}), + ...(harness !== undefined ? { harness } : {}), + ...(typeof opts["awsRegion"] === "string" ? { awsRegion: opts["awsRegion"] } : {}), + ...(typeof opts["model"] === "string" ? { model: opts["model"] } : {}), + }); + probeMod.printVarianceReport(report, opts["json"] === true); + return; + } const mod = await import("./commands/code-pack.js"); const rawEngine = typeof opts["engine"] === "string" ? opts["engine"] : "pack"; const engine: "pack" | "repomix" = @@ -1032,6 +1073,19 @@ function parseQueryGranularity(raw: unknown): "symbol" | "file" | "community" | ); } +/** + * Parse a `--harness` value into the narrow agent set the variance probe + * accepts. Returns `undefined` when the flag was not supplied (the probe then + * runs the task's configured set — both agents by default). An unknown token + * throws so the user sees the typo rather than a silent fallback. + */ +function parseHarness(raw: unknown): "claude" | "codex" | undefined { + if (typeof raw !== "string" || raw.trim() === "") return undefined; + const trimmed = raw.trim(); + if (trimmed === "claude" || trimmed === "codex") return trimmed; + throw new Error(`Unknown --harness value: "${trimmed}". Expected one of: claude, codex`); +} + /** * Parse a comma-separated `--granularity` value into the narrow set of * hierarchical embedding tiers the ingestion phase accepts. Returns diff --git a/packages/cli/tsconfig.json b/packages/cli/tsconfig.json index 44e105a..fb355d3 100644 --- a/packages/cli/tsconfig.json +++ b/packages/cli/tsconfig.json @@ -10,6 +10,7 @@ { "path": "../analysis" }, { "path": "../core-types" }, { "path": "../embedder" }, + { "path": "../eval" }, { "path": "../storage" }, { "path": "../search" }, { "path": "../ingestion" }, diff --git a/packages/eval/README.md b/packages/eval/README.md index f31fa03..4c23e33 100644 --- a/packages/eval/README.md +++ b/packages/eval/README.md @@ -1,7 +1,49 @@ -# @opencodehub/eval — extracted +# @opencodehub/eval — variance probe -The Python retrieval / graph-quality evaluation harness that used to live -here was extracted to the sibling `opencodehub-testbed` repository so the -production package set ships free of test-time dependencies. Any local -`.venv/`, `.pytest_cache/`, `.ruff_cache/`, or `src/` left in this folder -is untracked and gitignored — see `opencodehub-testbed` for the harness. +`@opencodehub/eval` measures the **run-to-run answer variance** a coding agent +shows with vs. without an OpenCodeHub code-pack in its context. It is the +empirical instrument behind the decision-equivalence contract (Move 6): if the +pack genuinely pins the agent's retrieval decision, the agent's answer wanders +less across repeated runs. The probe turns that claim into a number. + +> The Python retrieval / graph-quality harness that *used* to live here was +> extracted to the sibling `opencodehub-testbed` repo so the published package +> set ships free of test-time dependencies. This TypeScript probe honors that +> intent: it is pure JS + `node:child_process` (no heavy runtime deps, no +> Python), and the package is `private`. It is force-bundled into the +> `@opencodehub/cli` tarball (tsup `noExternal`), so it never adds an +> independently published runtime package. + +## What it does + +Given a **task** — a fixed triple `(repo @ commit, instruction, success_oracle)` — +the probe runs a coding agent N times in each of two arms (with-pack / +without-pack), holding commit, instruction, agent, and model fixed. The only +manipulated variable is whether the OCH pack is in context. It then computes a +per-arm dispersion statistic appropriate to the oracle and reports the +`without − with` delta alongside the token overhead. + +| Oracle | Dispersion statistic | Use | +|---|---|---| +| `output_hash` | distinct-output ratio `(# distinct outputs)/N` | zero-config quick look | +| `assertion` (default) | pass-rate + failure-rate stddev across N | objective, defensible headline | +| `judge` | stddev of LLM-panel rubric scores | tasks with no mechanical oracle | + +## Inference backend + +The direct-CLI runner drives `claude -p` (Claude Code) and `codex exec` +(Codex), and **both route inference through Amazon Bedrock** (spec 010 §4a): +Claude Code via `CLAUDE_CODE_USE_BEDROCK=1` + a `us.`-prefixed +`ANTHROPIC_MODEL` inference profile; Codex via its first-party +`amazon-bedrock` provider. AWS credentials and region are inherited from the +operator's environment. + +## Determinism of the probe's own output + +The agent runs are nondeterministic by nature — that nondeterminism is exactly +what's being measured. The probe's *report*, by contrast, is a pure function of +the captured run outcomes: no wall-clock, no run-id. Two probe runs over the +same captured outcomes serialize byte-identically (same discipline as the +context-bom). + +See `.erpaval/specs/010-variance-probe/spec.md` for the full design. diff --git a/packages/eval/examples/variance-task.yaml b/packages/eval/examples/variance-task.yaml new file mode 100644 index 0000000..20b26d4 --- /dev/null +++ b/packages/eval/examples/variance-task.yaml @@ -0,0 +1,34 @@ +# Example variance-probe task (spec 010 / Move 2). +# +# Run with: +# codehub code-pack --variance-probe packages/eval/examples/variance-task.yaml +# +# The probe generates the OCH pack for `repo`, then runs the agent N times +# (default 10) per arm — with vs. without the pack in context — and reports the +# dispersion delta + token overhead. Agents run on Amazon Bedrock; the probe +# inherits your AWS region + credentials from the environment. + +# Human-facing id, used to label the report. +id: add-json-flag-to-status + +# A pinned checkout. Frozen so the pack is the only variable across runs. +repo: /abs/path/to/your/repo +commit: 0000000000000000000000000000000000000000 + +# The natural-language ask, given verbatim to the agent every run. +instruction: > + Add a --json flag to `codehub status` that prints the staleness record as + JSON on stdout. Keep the existing human-readable output as the default. + +# How a run is scored. The default oracle is `assertion` — objective and +# defensible. Exit code 0 = pass; the probe reports pass-rate dispersion across +# the N runs in each arm. +oracle: + type: assertion + command: pnpm --filter @opencodehub/cli build:test && node --test "./packages/cli/dist-test/commands/status.test.js" + timeoutMs: 300000 + +# Optional: pin one agent. Omit to run the default set (claude + codex), which +# gives the agent-neutral claim ("the pack halves variance regardless of which +# agent reads it"). +# harness: claude diff --git a/packages/eval/package.json b/packages/eval/package.json new file mode 100644 index 0000000..0ab2989 --- /dev/null +++ b/packages/eval/package.json @@ -0,0 +1,67 @@ +{ + "name": "@opencodehub/eval", + "version": "0.1.0", + "private": true, + "description": "OpenCodeHub — variance probe: measure the run-to-run answer variance an OCH pack removes from a coding agent", + "license": "Apache-2.0", + "repository": { + "type": "git", + "url": "git+https://github.com/theagenticguy/opencodehub.git", + "directory": "packages/eval" + }, + "homepage": "https://github.com/theagenticguy/opencodehub#readme", + "bugs": { + "url": "https://github.com/theagenticguy/opencodehub/issues" + }, + "type": "module", + "main": "./dist/index.js", + "types": "./dist/index.d.ts", + "exports": { + ".": { + "types": "./dist/index.d.ts", + "import": "./dist/index.js" + } + }, + "files": [ + "dist/**/*.js", + "!dist/**/*.test.js", + "dist/**/*.d.ts", + "!dist/**/*.test.d.ts", + "dist/**/*.js.map", + "!dist/**/*.test.js.map", + "dist/**/*.d.ts.map", + "!dist/**/*.test.d.ts.map" + ], + "scripts": { + "build": "tsc -b", + "test": "node --test \"./dist/**/*.test.js\"", + "clean": "rm -rf dist *.tsbuildinfo" + }, + "dependencies": { + "@opencodehub/core-types": "workspace:*", + "yaml": "2.9.0", + "zod": "4.4.3" + }, + "devDependencies": { + "@types/node": "26.0.1", + "typescript": "6.0.3" + }, + "publishConfig": { + "access": "public" + }, + "keywords": [ + "opencodehub", + "code-intelligence", + "mcp", + "model-context-protocol", + "ai", + "code-graph", + "variance", + "determinism", + "eval", + "agent-eval" + ], + "engines": { + "node": ">=24.15.0" + } +} diff --git a/packages/eval/src/cli-runner.test.ts b/packages/eval/src/cli-runner.test.ts new file mode 100644 index 0000000..08e744e --- /dev/null +++ b/packages/eval/src/cli-runner.test.ts @@ -0,0 +1,231 @@ +import { strict as assert } from "node:assert"; +import { describe, it } from "node:test"; +import { + buildAgentEnv, + buildArgv, + CliAgentRunner, + composePrompt, + DEFAULT_CLAUDE_MODEL, + DEFAULT_CODEX_MODEL, + parseClaudeOutput, + parseCodexOutput, + type SpawnFn, +} from "./cli-runner.js"; +import type { RunRequest } from "./runner.js"; +import type { Task } from "./task.js"; + +const TASK: Task = { + id: "t", + repo: "/tmp/repo", + commit: "deadbeef", + instruction: "Add a --json flag.", + oracle: { type: "output_hash", field: "final_text" }, +}; + +describe("buildAgentEnv — Bedrock wiring (§4a)", () => { + it("sets CLAUDE_CODE_USE_BEDROCK and a us.-profile model for claude", () => { + const env = buildAgentEnv({ harness: "claude", awsRegion: "us-east-1", _baseEnv: {} }); + assert.equal(env["CLAUDE_CODE_USE_BEDROCK"], "1"); + assert.equal(env["ANTHROPIC_MODEL"], DEFAULT_CLAUDE_MODEL); + assert.ok(env["ANTHROPIC_MODEL"]?.startsWith("us."), "inference profile is us.-prefixed"); + assert.equal(env["AWS_REGION"], "us-east-1"); + }); + + it("honors an explicit model override for claude", () => { + const env = buildAgentEnv({ + harness: "claude", + model: "us.anthropic.claude-opus-4-8", + _baseEnv: {}, + }); + assert.equal(env["ANTHROPIC_MODEL"], "us.anthropic.claude-opus-4-8"); + }); + + it("does NOT set Claude Bedrock vars for the codex harness (codex uses argv -c)", () => { + const env = buildAgentEnv({ harness: "codex", awsRegion: "us-west-2", _baseEnv: {} }); + assert.equal(env["CLAUDE_CODE_USE_BEDROCK"], undefined); + assert.equal(env["ANTHROPIC_MODEL"], undefined); + assert.equal(env["AWS_REGION"], "us-west-2"); + }); + + it("falls back to the inherited AWS_REGION / AWS_DEFAULT_REGION", () => { + const env = buildAgentEnv({ harness: "claude", _baseEnv: { AWS_DEFAULT_REGION: "eu-west-1" } }); + assert.equal(env["AWS_REGION"], "eu-west-1"); + }); + + it("preserves inherited credentials (does not strip AWS_PROFILE / keys)", () => { + const env = buildAgentEnv({ + harness: "claude", + _baseEnv: { AWS_PROFILE: "bedrock-dev", AWS_REGION: "us-east-1" }, + }); + assert.equal(env["AWS_PROFILE"], "bedrock-dev"); + }); +}); + +describe("buildArgv", () => { + it("builds the headless claude command with JSON output + model", () => { + const { command, argv } = buildArgv({ harness: "claude" }, "PROMPT"); + assert.equal(command, "claude"); + assert.deepEqual(argv, [ + "-p", + "PROMPT", + "--output-format", + "json", + "--model", + DEFAULT_CLAUDE_MODEL, + ]); + }); + + it("builds the codex exec command routed to the amazon-bedrock provider", () => { + const { command, argv } = buildArgv({ harness: "codex" }, "PROMPT"); + assert.equal(command, "codex"); + assert.deepEqual(argv, [ + "exec", + "--json", + "-c", + "model_provider=amazon-bedrock", + "-m", + DEFAULT_CODEX_MODEL, + "--skip-git-repo-check", + "PROMPT", + ]); + }); +}); + +describe("composePrompt", () => { + it("returns the bare instruction in the without-pack arm", () => { + const req: RunRequest = { task: TASK, harness: "claude", withPack: false }; + assert.equal(composePrompt(req), TASK.instruction); + }); + it("prepends the pack context in the with-pack arm", () => { + const req: RunRequest = { + task: TASK, + harness: "claude", + withPack: true, + packContext: "PACK BODY", + }; + const prompt = composePrompt(req); + assert.ok(prompt.startsWith("PACK BODY")); + assert.ok(prompt.endsWith(TASK.instruction)); + }); + it("ignores an empty pack context (falls back to bare instruction)", () => { + const req: RunRequest = { task: TASK, harness: "claude", withPack: true, packContext: "" }; + assert.equal(composePrompt(req), TASK.instruction); + }); +}); + +describe("parseClaudeOutput", () => { + it("extracts result text + usage + cost from the JSON result object", () => { + const stdout = JSON.stringify({ + type: "result", + subtype: "success", + result: "Done — added the flag.", + usage: { input_tokens: 1234, output_tokens: 56 }, + total_cost_usd: 0.0123, + }); + const { finalText, tokens } = parseClaudeOutput(stdout); + assert.equal(finalText, "Done — added the flag."); + assert.equal(tokens.inputTokens, 1234); + assert.equal(tokens.outputTokens, 56); + assert.equal(tokens.costUsd, 0.0123); + }); + it("tolerates a missing usage block (zeros, null cost)", () => { + const { tokens, finalText } = parseClaudeOutput(JSON.stringify({ result: "x" })); + assert.equal(finalText, "x"); + assert.equal(tokens.inputTokens, 0); + assert.equal(tokens.costUsd, null); + }); + it("throws on unparseable stdout", () => { + assert.throws(() => parseClaudeOutput("not json")); + }); +}); + +describe("parseCodexOutput", () => { + const jsonl = [ + JSON.stringify({ type: "thread.started", thread_id: "x" }), + JSON.stringify({ type: "turn.started" }), + "some progress noise that is not json", + JSON.stringify({ + type: "item.completed", + item: { id: "i1", type: "command_execution", status: "completed" }, + }), + JSON.stringify({ + type: "item.completed", + item: { id: "i2", type: "agent_message", text: "Repo summary here." }, + }), + JSON.stringify({ + type: "turn.completed", + usage: { + input_tokens: 24763, + cached_input_tokens: 24448, + output_tokens: 122, + reasoning_output_tokens: 0, + }, + }), + ].join("\n"); + + it("extracts the final agent_message text and the last turn.completed usage", () => { + const { finalText, tokens } = parseCodexOutput(jsonl); + assert.equal(finalText, "Repo summary here."); + assert.equal(tokens.inputTokens, 24763); + assert.equal(tokens.outputTokens, 122); + assert.equal(tokens.costUsd, null, "codex exposes no per-invocation USD cost"); + }); + + it("returns empty/zeros when no agent_message is present", () => { + const { finalText, tokens } = parseCodexOutput(JSON.stringify({ type: "turn.started" })); + assert.equal(finalText, ""); + assert.equal(tokens.inputTokens, 0); + }); +}); + +describe("CliAgentRunner.run (stubbed spawn)", () => { + it("returns a successful outcome from a claude run", async () => { + const spawnFn: SpawnFn = async ({ command, argv }) => { + assert.equal(command, "claude"); + assert.ok(argv.includes("--output-format")); + return { + stdout: JSON.stringify({ result: "ok", usage: { input_tokens: 10, output_tokens: 2 } }), + stderr: "", + code: 0, + }; + }; + const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} }); + const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false }); + assert.equal(outcome.errored, false); + assert.equal(outcome.finalText, "ok"); + assert.equal(outcome.tokens.inputTokens, 10); + assert.equal(outcome.checkoutPath, TASK.repo); + assert.equal(runner.name, "cli:claude"); + }); + + it("marks the outcome errored on a non-zero exit (no throw)", async () => { + const spawnFn: SpawnFn = async () => ({ stdout: "", stderr: "boom", code: 1 }); + const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} }); + const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false }); + assert.equal(outcome.errored, true); + assert.equal(outcome.tokens.inputTokens, 0); + }); + + it("marks the outcome errored when stdout is unparseable", async () => { + const spawnFn: SpawnFn = async () => ({ stdout: "garbage", stderr: "", code: 0 }); + const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} }); + const outcome = await runner.run({ task: TASK, harness: "claude", withPack: false }); + assert.equal(outcome.errored, true); + }); + + it("injects the pack context into the prompt on the with-pack arm", async () => { + let seenPrompt = ""; + const spawnFn: SpawnFn = async ({ argv }) => { + // claude argv: ["-p", PROMPT, ...] + seenPrompt = argv[1] ?? ""; + return { + stdout: JSON.stringify({ result: "ok", usage: { input_tokens: 1, output_tokens: 1 } }), + stderr: "", + code: 0, + }; + }; + const runner = new CliAgentRunner({ harness: "claude", _spawn: spawnFn, _baseEnv: {} }); + await runner.run({ task: TASK, harness: "claude", withPack: true, packContext: "THE PACK" }); + assert.ok(seenPrompt.startsWith("THE PACK")); + }); +}); diff --git a/packages/eval/src/cli-runner.ts b/packages/eval/src/cli-runner.ts new file mode 100644 index 0000000..fbbf0cf --- /dev/null +++ b/packages/eval/src/cli-runner.ts @@ -0,0 +1,292 @@ +/** + * Direct-CLI runner (spec 010 §4, §4a) — v1's {@link AgentRunner}. + * + * Shells out to `claude -p` (Claude Code) and `codex exec` (Codex) in headless + * mode, with **inference routed through Amazon Bedrock** for both. The Bedrock + * env/flag wiring is grounded against current docs (code.claude.com, + * developers.openai.com/codex), not recalled: + * + * Claude Code → Bedrock: + * env CLAUDE_CODE_USE_BEDROCK=1, AWS_REGION, ANTHROPIC_MODEL= + * cmd claude -p "" --output-format json --model + * out one JSON object: .result, .usage.{input,output}_tokens, .total_cost_usd + * + * Codex → Bedrock (first-party `amazon-bedrock` provider): + * env AWS_REGION (+ AWS_BEARER_TOKEN_BEDROCK or AWS SDK creds) + * cmd codex exec --json -c model_provider=amazon-bedrock -m + * --skip-git-repo-check "" + * out JSONL; final = last item.completed type "agent_message"; tokens = + * last turn.completed.usage (no total — sum input+output) + * + * The pure pieces (env construction, prompt composition, output parsing) are + * exported and unit-tested; the spawn itself is integration-only (needs the + * CLIs + AWS creds, which CI sandboxes lack) and is exercised via the + * `_spawn` seam in tests. + */ + +import { spawn } from "node:child_process"; +import type { AgentRunner, Harness, RunOutcome, RunRequest, RunTokens } from "./runner.js"; + +/** Default Claude Code Bedrock inference profile (us.-prefixed, §4a). */ +export const DEFAULT_CLAUDE_MODEL = "us.anthropic.claude-sonnet-4-6"; +/** Default Codex Bedrock model id (§4a). */ +export const DEFAULT_CODEX_MODEL = "openai.gpt-5.5"; + +export interface CliRunnerConfig { + readonly harness: Harness; + /** + * Bedrock model id / inference profile. Defaults per harness + * ({@link DEFAULT_CLAUDE_MODEL} / {@link DEFAULT_CODEX_MODEL}). + */ + readonly model?: string; + /** + * AWS region for Bedrock. Falls back to the inherited `AWS_REGION` / + * `AWS_DEFAULT_REGION`; an explicit value here overrides it. + */ + readonly awsRegion?: string; + /** + * Test seam — inject a spawn function so unit tests don't shell out to the + * real CLIs. Production leaves this unset. + */ + readonly _spawn?: SpawnFn; + /** + * Test seam — inject the base environment (defaults to `process.env`) so + * tests can assert the Bedrock vars are layered on without depending on the + * host env. + */ + readonly _baseEnv?: NodeJS.ProcessEnv; +} + +/** Minimal spawn result the runner consumes. */ +export interface SpawnResult { + readonly stdout: string; + readonly stderr: string; + /** Process exit code; null on signal termination. */ + readonly code: number | null; +} + +/** Spawn function shape (the `_spawn` seam). */ +export type SpawnFn = (args: { + readonly command: string; + readonly argv: readonly string[]; + readonly env: NodeJS.ProcessEnv; + readonly cwd: string; + readonly stdin: string; +}) => Promise; + +/** + * Build the child-process environment for a harness, layering the Bedrock + * wiring (§4a) over the inherited base env. Pure + exported for tests. + */ +export function buildAgentEnv(config: CliRunnerConfig): NodeJS.ProcessEnv { + const base = config._baseEnv ?? process.env; + const region = config.awsRegion ?? base["AWS_REGION"] ?? base["AWS_DEFAULT_REGION"]; + const env: NodeJS.ProcessEnv = { ...base }; + if (region !== undefined) env["AWS_REGION"] = region; + + if (config.harness === "claude") { + // Claude Code → Bedrock. Credentials resolve via the default AWS SDK chain + // already present in `base` (env keys / AWS_PROFILE / SSO / bearer token); + // we only assert the toggle + model. + env["CLAUDE_CODE_USE_BEDROCK"] = "1"; + env["ANTHROPIC_MODEL"] = config.model ?? DEFAULT_CLAUDE_MODEL; + } + // Codex selects Bedrock via `-c model_provider=amazon-bedrock` on the + // argv (see buildArgv), not via env; its AWS auth is inherited from base. + return env; +} + +/** + * Compose the prompt handed to the agent. The with-pack arm prepends the OCH + * pack context; the without-pack arm is the bare instruction. Pure + exported. + */ +export function composePrompt(request: RunRequest): string { + if (request.withPack && request.packContext !== undefined && request.packContext.length > 0) { + return `${request.packContext}\n\n---\n\n${request.task.instruction}`; + } + return request.task.instruction; +} + +/** Build the argv for a harness invocation. Pure + exported for tests. */ +export function buildArgv( + config: CliRunnerConfig, + prompt: string, +): { command: string; argv: string[] } { + if (config.harness === "claude") { + return { + command: "claude", + argv: [ + "-p", + prompt, + "--output-format", + "json", + "--model", + config.model ?? DEFAULT_CLAUDE_MODEL, + ], + }; + } + // Codex: route to the first-party Bedrock provider via -c, pin the model, + // emit JSONL with --json, and skip the git-repo guard so the probe can run + // the agent against an arbitrary checkout. + return { + command: "codex", + argv: [ + "exec", + "--json", + "-c", + "model_provider=amazon-bedrock", + "-m", + config.model ?? DEFAULT_CODEX_MODEL, + "--skip-git-repo-check", + prompt, + ], + }; +} + +/** + * Parse Claude Code's `--output-format json` single result object into a + * {@link RunOutcome} (minus checkout/errored, filled by the runner). Pure + + * exported. Throws on unparseable output so a malformed run is a hard error, + * not a silent zero-token success. + */ +export function parseClaudeOutput(stdout: string): { finalText: string; tokens: RunTokens } { + const doc = JSON.parse(stdout) as { + result?: unknown; + usage?: { input_tokens?: unknown; output_tokens?: unknown }; + total_cost_usd?: unknown; + }; + const finalText = typeof doc.result === "string" ? doc.result : ""; + const inputTokens = num(doc.usage?.input_tokens); + const outputTokens = num(doc.usage?.output_tokens); + const costUsd = typeof doc.total_cost_usd === "number" ? doc.total_cost_usd : null; + return { finalText, tokens: { inputTokens, outputTokens, costUsd } }; +} + +/** + * Parse Codex's `--json` JSONL stream. Final answer = the last + * `item.completed` whose item type is `agent_message`; tokens = the last + * `turn.completed.usage` (Codex reports no total — we sum input+output). Pure + + * exported. Tolerates interleaved non-JSON lines (progress noise) by skipping + * them. + */ +export function parseCodexOutput(stdout: string): { finalText: string; tokens: RunTokens } { + let finalText = ""; + let inputTokens = 0; + let outputTokens = 0; + for (const line of stdout.split("\n")) { + const trimmed = line.trim(); + if (trimmed.length === 0) continue; + let evt: unknown; + try { + evt = JSON.parse(trimmed); + } catch { + continue; // progress / non-JSON line — skip + } + if (typeof evt !== "object" || evt === null) continue; + const e = evt as { + type?: unknown; + item?: { type?: unknown; text?: unknown }; + usage?: { input_tokens?: unknown; output_tokens?: unknown }; + }; + if (e.type === "item.completed" && e.item?.type === "agent_message") { + if (typeof e.item.text === "string") finalText = e.item.text; + } else if (e.type === "turn.completed" && e.usage !== undefined) { + inputTokens = num(e.usage.input_tokens); + outputTokens = num(e.usage.output_tokens); + } + } + // Codex does not surface a per-invocation USD cost on the public event + // schema, so cost is null (the report tolerates a null-cost arm). + return { finalText, tokens: { inputTokens, outputTokens, costUsd: null } }; +} + +function num(v: unknown): number { + return typeof v === "number" && Number.isFinite(v) ? v : 0; +} + +/** Default production spawn — real `node:child_process.spawn`, buffered. */ +const defaultSpawn: SpawnFn = ({ command, argv, env, cwd, stdin }) => + new Promise((resolve) => { + const child = spawn(command, [...argv], { env, cwd, stdio: ["pipe", "pipe", "pipe"] }); + let stdout = ""; + let stderr = ""; + child.stdout.on("data", (d: Buffer) => { + stdout += d.toString("utf8"); + }); + child.stderr.on("data", (d: Buffer) => { + stderr += d.toString("utf8"); + }); + child.on("error", (err) => { + // Binary missing / not executable — surface as a non-zero result rather + // than rejecting, so the runner can wrap it in an actionable error. + resolve({ stdout, stderr: `${stderr}${String(err)}`, code: 127 }); + }); + child.on("close", (code) => resolve({ stdout, stderr, code })); + if (stdin.length > 0) child.stdin.write(stdin); + child.stdin.end(); + }); + +/** + * The direct-CLI runner. Each `run` spawns a fresh agent process (a fresh + * session, per §3), composes the arm-appropriate prompt, parses the harness's + * structured output, and returns a {@link RunOutcome}. A spawn/exit failure is + * captured as an `errored` outcome rather than throwing, so one crashed run + * doesn't abort the experiment — a pack that reduces crashes is lower variance. + */ +export class CliAgentRunner implements AgentRunner { + readonly name: string; + private readonly config: CliRunnerConfig; + private readonly spawnFn: SpawnFn; + + constructor(config: CliRunnerConfig) { + this.config = config; + this.name = `cli:${config.harness}`; + this.spawnFn = config._spawn ?? defaultSpawn; + } + + async run(request: RunRequest): Promise { + const prompt = composePrompt(request); + const { command, argv } = buildArgv(this.config, prompt); + const env = buildAgentEnv(this.config); + // The checkout the agent works in. v1 runs the agent against the task's + // repo path directly; a future iteration can clone per-run for isolation. + const cwd = request.task.repo; + + let result: SpawnResult; + try { + result = await this.spawnFn({ command, argv, env, cwd, stdin: "" }); + } catch (err) { + return erroredOutcome(cwd, `spawn failed: ${String(err)}`); + } + + if (result.code !== 0) { + return erroredOutcome(cwd, result.stderr || `${command} exited ${String(result.code)}`); + } + + try { + const parsed = + this.config.harness === "claude" + ? parseClaudeOutput(result.stdout) + : parseCodexOutput(result.stdout); + return { + finalText: parsed.finalText, + diff: "", // neither CLI surfaces a structured diff in headless JSON; left empty + tokens: parsed.tokens, + checkoutPath: cwd, + errored: false, + }; + } catch (err) { + return erroredOutcome(cwd, `failed to parse ${command} output: ${String(err)}`); + } + } +} + +function erroredOutcome(checkoutPath: string, finalText: string): RunOutcome { + return { + finalText, + diff: "", + tokens: { inputTokens: 0, outputTokens: 0, costUsd: null }, + checkoutPath, + errored: true, + }; +} diff --git a/packages/eval/src/dispersion.test.ts b/packages/eval/src/dispersion.test.ts new file mode 100644 index 0000000..09557a4 --- /dev/null +++ b/packages/eval/src/dispersion.test.ts @@ -0,0 +1,120 @@ +import { strict as assert } from "node:assert"; +import { describe, it } from "node:test"; +import { + type ArmDispersion, + bernoulliDispersion, + dispersionScalar, + distinctOutputRatio, + mean, + populationStddev, +} from "./dispersion.js"; + +describe("populationStddev", () => { + it("returns 0 for fewer than two values", () => { + assert.equal(populationStddev([]), 0); + assert.equal(populationStddev([5]), 0); + }); + + it("is zero for a constant sample", () => { + assert.equal(populationStddev([3, 3, 3, 3]), 0); + }); + + it("computes the population (not sample) stddev", () => { + // values [0,1] → mean 0.5, variance ((.25)+(.25))/2 = .25, stddev .5 + assert.equal(populationStddev([0, 1]), 0.5); + // [2,4,4,4,5,5,7,9] is the classic example: population stddev = 2 + assert.equal(populationStddev([2, 4, 4, 4, 5, 5, 7, 9]), 2); + }); +}); + +describe("mean", () => { + it("returns 0 for empty input", () => { + assert.equal(mean([]), 0); + }); + it("averages", () => { + assert.equal(mean([1, 2, 3, 4]), 2.5); + }); +}); + +describe("distinctOutputRatio", () => { + it("returns 0 for empty input", () => { + assert.equal(distinctOutputRatio([]), 0); + }); + it("is 1/N when every output is identical (perfectly stable)", () => { + assert.equal(distinctOutputRatio(["a", "a", "a", "a"]), 0.25); + }); + it("is 1.0 when every output differs (maximally unstable)", () => { + assert.equal(distinctOutputRatio(["a", "b", "c"]), 1); + }); + it("counts distinct values", () => { + assert.equal(distinctOutputRatio(["a", "a", "b", "b"]), 0.5); + }); +}); + +describe("bernoulliDispersion", () => { + it("returns zeros for empty input", () => { + assert.deepEqual(bernoulliDispersion([]), { passRate: 0, stddev: 0 }); + }); + it("is zero-dispersion when the agent is perfectly consistent (all pass)", () => { + const d = bernoulliDispersion([true, true, true, true]); + assert.equal(d.passRate, 1); + assert.equal(d.stddev, 0); + }); + it("is zero-dispersion when perfectly consistent (all fail)", () => { + const d = bernoulliDispersion([false, false, false]); + assert.equal(d.passRate, 0); + assert.equal(d.stddev, 0); + }); + it("is maximal at a 50/50 coin-flip agent", () => { + const d = bernoulliDispersion([true, false, true, false]); + assert.equal(d.passRate, 0.5); + assert.equal(d.stddev, 0.5); // sqrt(0.5*0.5) + }); + it("captures the headline example: 6/10 with vs 3/10 without", () => { + const withPack = bernoulliDispersion([ + true, + true, + true, + true, + true, + true, + false, + false, + false, + false, + ]); + const withoutPack = bernoulliDispersion([ + true, + true, + true, + false, + false, + false, + false, + false, + false, + false, + ]); + assert.equal(withPack.passRate, 0.6); + assert.equal(withoutPack.passRate, 0.3); + // without-pack stddev (p=0.3) is higher than the toward-extreme... actually + // 0.6 is closer to 0.5 than 0.3, so with-pack stddev is HIGHER here — the + // pass-rate is the headline; stddev measures distance from a decided agent. + assert.ok(withPack.stddev > 0 && withoutPack.stddev > 0); + }); +}); + +describe("dispersionScalar", () => { + it("uses the distinct ratio for output_hash", () => { + const d: ArmDispersion = { kind: "output_hash", distinctRatio: 0.7, runs: 10 }; + assert.equal(dispersionScalar(d), 0.7); + }); + it("uses the stddev for assertion", () => { + const d: ArmDispersion = { kind: "assertion", passRate: 0.6, stddev: 0.49, runs: 10 }; + assert.equal(dispersionScalar(d), 0.49); + }); + it("uses the stddev for judge", () => { + const d: ArmDispersion = { kind: "judge", meanScore: 0.8, stddev: 0.12, runs: 10 }; + assert.equal(dispersionScalar(d), 0.12); + }); +}); diff --git a/packages/eval/src/dispersion.ts b/packages/eval/src/dispersion.ts new file mode 100644 index 0000000..689b778 --- /dev/null +++ b/packages/eval/src/dispersion.ts @@ -0,0 +1,99 @@ +/** + * Dispersion statistics — the per-arm "how much did the answer wander" numbers + * (spec 010 §2). Pure functions over an arm's N outcomes; no I/O, no clock. + * These are the most safety-critical part of the probe, so they're isolated + * here and unit-covered exhaustively. + * + * Each oracle type maps to one dispersion statistic: + * - `output_hash` → distinct-output ratio = (# distinct outputs) / N + * - `assertion` → pass-rate + failure-rate stddev (Bernoulli) across N + * - `judge` → stddev of rubric scores across N + * + * Lower dispersion = the agent's behavior is more stable run-to-run. The + * Move-2 claim holds when the with-pack arm's dispersion is materially below + * the without-pack arm's. + */ + +/** Population standard deviation of a sample. Returns 0 for <2 values. */ +export function populationStddev(values: readonly number[]): number { + const n = values.length; + if (n < 2) return 0; + let sum = 0; + for (const v of values) sum += v; + const mean = sum / n; + let acc = 0; + for (const v of values) { + const d = v - mean; + acc += d * d; + } + return Math.sqrt(acc / n); +} + +/** Arithmetic mean. Returns 0 for an empty input. */ +export function mean(values: readonly number[]): number { + const n = values.length; + if (n === 0) return 0; + let sum = 0; + for (const v of values) sum += v; + return sum / n; +} + +/** + * Distinct-output ratio for the `output_hash` oracle. `1.0` = every run + * produced a different output (maximally unstable); `1/N` = every run produced + * the same output (perfectly stable). Returns 0 for an empty input. + */ +export function distinctOutputRatio(outputs: readonly string[]): number { + const n = outputs.length; + if (n === 0) return 0; + return new Set(outputs).size / n; +} + +/** + * Bernoulli (pass/fail) dispersion for the `assertion` oracle. `passes[i]` is + * true when run i passed. Returns the pass rate and the population stddev of + * the pass indicator (treating pass=1, fail=0). For a Bernoulli sample the + * stddev is `sqrt(p*(1-p))`, maximized at p=0.5 (a coin-flip agent) and zero + * when the agent is perfectly consistent (all-pass or all-fail). + */ +export function bernoulliDispersion(passes: readonly boolean[]): { + readonly passRate: number; + readonly stddev: number; +} { + const n = passes.length; + if (n === 0) return { passRate: 0, stddev: 0 }; + const indicators = passes.map((p) => (p ? 1 : 0)); + return { passRate: mean(indicators), stddev: populationStddev(indicators) }; +} + +/** Discriminated dispersion result, tagged by the oracle that produced it. */ +export type ArmDispersion = + | { readonly kind: "output_hash"; readonly distinctRatio: number; readonly runs: number } + | { + readonly kind: "assertion"; + readonly passRate: number; + readonly stddev: number; + readonly runs: number; + } + | { + readonly kind: "judge"; + readonly meanScore: number; + readonly stddev: number; + readonly runs: number; + }; + +/** + * The single scalar each `ArmDispersion` reduces to for the with/without delta. + * Lower = more stable. For `output_hash` it's the distinct ratio; for + * `assertion` and `judge` it's the stddev (pass-rate / score stability). + */ +export function dispersionScalar(d: ArmDispersion): number { + switch (d.kind) { + case "output_hash": + return d.distinctRatio; + case "assertion": + return d.stddev; + case "judge": + return d.stddev; + } +} diff --git a/packages/eval/src/index.ts b/packages/eval/src/index.ts new file mode 100644 index 0000000..65efabd --- /dev/null +++ b/packages/eval/src/index.ts @@ -0,0 +1,66 @@ +/** + * `@opencodehub/eval` — the variance probe (spec 010 / Move 2). + * + * Public surface: load a task, run the with/without experiment via a runner, + * score it with the task's oracle, and emit a deterministic report. v1 ships + * the Bedrock-wired direct-CLI runner; the probe core is runner-agnostic. + */ + +export { + buildAgentEnv, + buildArgv, + CliAgentRunner, + type CliRunnerConfig, + composePrompt, + DEFAULT_CLAUDE_MODEL, + DEFAULT_CODEX_MODEL, + parseClaudeOutput, + parseCodexOutput, + type SpawnFn, + type SpawnResult, +} from "./cli-runner.js"; +export { + type ArmDispersion, + bernoulliDispersion, + dispersionScalar, + distinctOutputRatio, + mean, + populationStddev, +} from "./dispersion.js"; +export { type JudgeScorer, type ScoreOptions, scoreArm } from "./oracle.js"; +export { + DEFAULT_RUNS, + type ProbeOptions, + type ProbeRunEvent, + probeHarness, + resolveHarnesses, + runProbe, +} from "./probe.js"; +export { + type ArmReport, + type ArmTokens, + buildHarnessReport, + formatReport, + type HarnessReport, + serializeReport, + TOKEN_OVERHEAD_FLAG, + type VarianceReport, +} from "./report.js"; +export type { + AgentRunner, + Harness, + RunOutcome, + RunRequest, + RunTokens, +} from "./runner.js"; +export { + type AssertionOracle, + type JudgeOracle, + loadTask, + type Oracle, + OracleSchema, + type OutputHashOracle, + type Task, + TaskSchema, + TaskValidationError, +} from "./task.js"; diff --git a/packages/eval/src/oracle.test.ts b/packages/eval/src/oracle.test.ts new file mode 100644 index 0000000..6565e4c --- /dev/null +++ b/packages/eval/src/oracle.test.ts @@ -0,0 +1,150 @@ +import { strict as assert } from "node:assert"; +import { mkdtemp, rm } from "node:fs/promises"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { after, before, describe, it } from "node:test"; +import { scoreArm } from "./oracle.js"; +import type { RunOutcome } from "./runner.js"; +import type { AssertionOracle, JudgeOracle, OutputHashOracle } from "./task.js"; + +const outcome = (over: Partial): RunOutcome => ({ + finalText: "", + diff: "", + tokens: { inputTokens: 0, outputTokens: 0, costUsd: null }, + errored: false, + ...over, +}); + +describe("scoreArm — output_hash", () => { + const oracle: OutputHashOracle = { type: "output_hash", field: "final_text" }; + + it("is 1/N (stable) when every final_text is identical", async () => { + const outcomes = [ + outcome({ finalText: "same" }), + outcome({ finalText: "same" }), + outcome({ finalText: "same" }), + ]; + const d = await scoreArm(oracle, outcomes); + assert.equal(d.kind, "output_hash"); + if (d.kind === "output_hash") assert.ok(Math.abs(d.distinctRatio - 1 / 3) < 1e-9); + }); + + it("counts an errored run as its own distinct outcome", async () => { + const outcomes = [ + outcome({ finalText: "x" }), + outcome({ finalText: "x" }), + outcome({ finalText: "x", errored: true }), + ]; + const d = await scoreArm(oracle, outcomes); + if (d.kind === "output_hash") assert.ok(Math.abs(d.distinctRatio - 2 / 3) < 1e-9); + }); + + it("hashes the diff field when configured", async () => { + const diffOracle: OutputHashOracle = { type: "output_hash", field: "diff" }; + const outcomes = [ + outcome({ finalText: "differ", diff: "same-diff" }), + outcome({ finalText: "wildly-different", diff: "same-diff" }), + ]; + const d = await scoreArm(diffOracle, outcomes); + if (d.kind === "output_hash") assert.equal(d.distinctRatio, 0.5, "1 distinct diff over 2 runs"); + }); +}); + +describe("scoreArm — assertion", () => { + let dir: string; + before(async () => { + dir = await mkdtemp(join(tmpdir(), "och-eval-oracle-")); + }); + after(async () => { + await rm(dir, { recursive: true, force: true }); + }); + + it("runs the shell command in each checkout: exit 0 = pass", async () => { + const oracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 }; + const outcomes = [ + outcome({ checkoutPath: dir }), + outcome({ checkoutPath: dir }), + outcome({ checkoutPath: dir }), + ]; + const d = await scoreArm(oracle, outcomes); + assert.equal(d.kind, "assertion"); + if (d.kind === "assertion") { + assert.equal(d.passRate, 1); + assert.equal(d.stddev, 0); + } + }); + + it("scores exit-non-zero as a fail", async () => { + const oracle: AssertionOracle = { type: "assertion", command: "false", timeoutMs: 30_000 }; + const d = await scoreArm(oracle, [outcome({ checkoutPath: dir })]); + if (d.kind === "assertion") assert.equal(d.passRate, 0); + }); + + it("scores an errored run (no checkout) as a fail without spawning", async () => { + const oracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 }; + const d = await scoreArm(oracle, [outcome({ errored: true })]); + if (d.kind === "assertion") assert.equal(d.passRate, 0); + }); + + it("produces mixed pass-rate dispersion", async () => { + // command passes only when the marker file exists; we toggle via cwd trick: + // here just alternate true/false commands to simulate a flaky agent. + const passOracle: AssertionOracle = { type: "assertion", command: "true", timeoutMs: 30_000 }; + const failOracle: AssertionOracle = { type: "assertion", command: "false", timeoutMs: 30_000 }; + const pass = await scoreArm(passOracle, [ + outcome({ checkoutPath: dir }), + outcome({ checkoutPath: dir }), + ]); + const fail = await scoreArm(failOracle, [ + outcome({ checkoutPath: dir }), + outcome({ checkoutPath: dir }), + ]); + if (pass.kind === "assertion" && fail.kind === "assertion") { + assert.equal(pass.passRate, 1); + assert.equal(fail.passRate, 0); + } + }); +}); + +describe("scoreArm — judge", () => { + const oracle: JudgeOracle = { type: "judge", rubric: "score correctness", panel: 2 }; + + it("throws a clear error when no JudgeScorer is supplied", async () => { + await assert.rejects( + () => scoreArm(oracle, [outcome({ finalText: "x" })]), + /requires a JudgeScorer/, + ); + }); + + it("averages the panel and computes score stddev", async () => { + // Deterministic judge: score = length-based fraction so two outcomes differ. + const judge = async (o: RunOutcome): Promise => (o.finalText === "good" ? 0.9 : 0.1); + const d = await scoreArm( + oracle, + [outcome({ finalText: "good" }), outcome({ finalText: "bad" })], + { judge }, + ); + assert.equal(d.kind, "judge"); + if (d.kind === "judge") { + assert.ok(Math.abs(d.meanScore - 0.5) < 1e-9); + assert.ok(d.stddev > 0); + } + }); + + it("scores an errored run as 0 without calling the judge", async () => { + let calls = 0; + const judge = async (): Promise => { + calls += 1; + return 1; + }; + const d = await scoreArm(oracle, [outcome({ errored: true })], { judge }); + if (d.kind === "judge") assert.equal(d.meanScore, 0); + assert.equal(calls, 0, "judge not called for an errored run"); + }); + + it("clamps out-of-range judge scores into [0,1]", async () => { + const judge = async (): Promise => 5; // misbehaving judge + const d = await scoreArm(oracle, [outcome({ finalText: "x" })], { judge }); + if (d.kind === "judge") assert.equal(d.meanScore, 1); + }); +}); diff --git a/packages/eval/src/oracle.ts b/packages/eval/src/oracle.ts new file mode 100644 index 0000000..ced4b77 --- /dev/null +++ b/packages/eval/src/oracle.ts @@ -0,0 +1,139 @@ +/** + * Oracle scoring — reduce an arm's N {@link RunOutcome}s to an + * {@link ArmDispersion} (spec 010 §2). + * + * - `output_hash` — pure: hash the configured field, count distinct values. + * - `assertion` — run a deterministic shell check against each run's + * checkout; pass/fail → Bernoulli dispersion. A run that errored, or whose + * command exceeds its timeout, scores as a fail (the worst outcome). + * - `judge` — LLM-panel rubric scoring. The panel call is injected as a + * dependency so the probe core stays free of any model client; v1 ships the + * interface and a guard that fails fast if no judge function is supplied. + * + * The `assertion` path is the only one that touches the filesystem / spawns a + * process; it is kept here (not in the pure dispersion module) so + * `dispersion.ts` remains a pure, exhaustively-testable leaf. + */ + +import { spawn } from "node:child_process"; +import { sha256Hex } from "@opencodehub/core-types"; +import { + type ArmDispersion, + bernoulliDispersion, + distinctOutputRatio, + populationStddev, +} from "./dispersion.js"; +import type { RunOutcome } from "./runner.js"; +import type { AssertionOracle, JudgeOracle, Oracle, OutputHashOracle } from "./task.js"; + +/** A judge-panel scorer: maps one run's outcome to a 0..1 rubric score. */ +export type JudgeScorer = (outcome: RunOutcome, rubric: string) => Promise; + +export interface ScoreOptions { + /** Required only when the task's oracle is `judge`. */ + readonly judge?: JudgeScorer; +} + +/** Score an arm's outcomes into a dispersion, dispatching on the oracle type. */ +export async function scoreArm( + oracle: Oracle, + outcomes: readonly RunOutcome[], + options: ScoreOptions = {}, +): Promise { + switch (oracle.type) { + case "output_hash": + return scoreOutputHash(oracle, outcomes); + case "assertion": + return scoreAssertion(oracle, outcomes); + case "judge": + return scoreJudge(oracle, outcomes, options.judge); + } +} + +function scoreOutputHash(oracle: OutputHashOracle, outcomes: readonly RunOutcome[]): ArmDispersion { + const hashes = outcomes.map((o) => { + const text = oracle.field === "diff" ? o.diff : o.finalText; + // An errored run hashes to a sentinel so a crash counts as its own distinct + // outcome rather than colliding with an empty-answer success. + return o.errored ? `__errored__:${sha256Hex(text)}` : sha256Hex(text); + }); + return { kind: "output_hash", distinctRatio: distinctOutputRatio(hashes), runs: outcomes.length }; +} + +async function scoreAssertion( + oracle: AssertionOracle, + outcomes: readonly RunOutcome[], +): Promise { + const passes: boolean[] = []; + for (const outcome of outcomes) { + if (outcome.errored || outcome.checkoutPath === undefined) { + // No checkout to assert against, or the agent crashed → worst outcome. + passes.push(false); + continue; + } + passes.push(await runAssertionCommand(oracle, outcome.checkoutPath)); + } + const { passRate, stddev } = bernoulliDispersion(passes); + return { kind: "assertion", passRate, stddev, runs: outcomes.length }; +} + +async function scoreJudge( + oracle: JudgeOracle, + outcomes: readonly RunOutcome[], + judge: JudgeScorer | undefined, +): Promise { + if (judge === undefined) { + throw new Error( + "eval: the `judge` oracle requires a JudgeScorer to be supplied to scoreArm; " + + "none was provided. Pass `options.judge` or use the `assertion`/`output_hash` oracle.", + ); + } + const scores: number[] = []; + for (const outcome of outcomes) { + if (outcome.errored) { + scores.push(0); // a crash is the worst rubric score + continue; + } + // Average the panel: call the judge `panel` times and mean the results, so + // judge-side noise doesn't inflate the agent-side variance we're measuring. + const panelScores: number[] = []; + for (let i = 0; i < oracle.panel; i += 1) { + panelScores.push(clamp01(await judge(outcome, oracle.rubric))); + } + scores.push(panelScores.reduce((a, b) => a + b, 0) / panelScores.length); + } + const meanScore = scores.length === 0 ? 0 : scores.reduce((a, b) => a + b, 0) / scores.length; + return { kind: "judge", meanScore, stddev: populationStddev(scores), runs: outcomes.length }; +} + +function clamp01(n: number): number { + if (!Number.isFinite(n)) return 0; + if (n < 0) return 0; + if (n > 1) return 1; + return n; +} + +/** + * Run the assertion command in the run's checkout. Resolves `true` on exit + * code 0, `false` otherwise (including spawn failure and timeout). Never + * rejects — a broken check is a failed assertion, not a probe crash. + */ +function runAssertionCommand(oracle: AssertionOracle, checkoutPath: string): Promise { + return new Promise((resolve) => { + const cwd = oracle.cwd !== undefined ? `${checkoutPath}/${oracle.cwd}` : checkoutPath; + const child = spawn(oracle.command, { + cwd, + shell: true, + stdio: "ignore", + timeout: oracle.timeoutMs, + }); + let settled = false; + const settle = (passed: boolean): void => { + if (settled) return; + settled = true; + resolve(passed); + }; + child.on("error", () => settle(false)); + child.on("close", (code) => settle(code === 0)); + }); +} diff --git a/packages/eval/src/probe.test.ts b/packages/eval/src/probe.test.ts new file mode 100644 index 0000000..13c73d0 --- /dev/null +++ b/packages/eval/src/probe.test.ts @@ -0,0 +1,127 @@ +import { strict as assert } from "node:assert"; +import { describe, it } from "node:test"; +import { DEFAULT_RUNS, resolveHarnesses, runProbe } from "./probe.js"; +import { serializeReport } from "./report.js"; +import type { AgentRunner, Harness, RunOutcome, RunRequest } from "./runner.js"; +import type { Task } from "./task.js"; + +const TASK: Task = { + id: "demo", + repo: "/tmp/repo", + commit: "c0ffee", + instruction: "Add a --json flag.", + oracle: { type: "output_hash", field: "final_text" }, +}; + +/** + * A deterministic fake runner: in the with-pack arm it returns a constant + * answer (perfectly stable); in the without-pack arm it returns a per-run + * distinct answer (maximally unstable). This is the canonical Move-2 shape — + * the pack should drive the dispersion delta strongly positive. + */ +class FakeRunner implements AgentRunner { + readonly name: string; + private i = 0; + constructor(public readonly harness: Harness) { + this.name = `fake:${harness}`; + } + run(request: RunRequest): Promise { + this.i += 1; + const finalText = request.withPack ? "stable answer" : `wandering answer ${this.i}`; + return Promise.resolve({ + finalText, + diff: "", + tokens: { + inputTokens: request.withPack ? 1100 : 1000, + outputTokens: 100, + costUsd: null, + }, + errored: false, + }); + } +} + +describe("resolveHarnesses", () => { + it("defaults to both agents when the task pins none", () => { + assert.deepEqual(resolveHarnesses(TASK, { packContext: "" }), ["claude", "codex"]); + }); + it("uses the task's pinned harness", () => { + assert.deepEqual(resolveHarnesses({ ...TASK, harness: "codex" }, { packContext: "" }), [ + "codex", + ]); + }); + it("honors an explicit options.harnesses override", () => { + assert.deepEqual(resolveHarnesses(TASK, { packContext: "", harnesses: ["claude"] }), [ + "claude", + ]); + }); +}); + +describe("runProbe (end-to-end with a fake runner)", () => { + it("measures a strong positive dispersion delta when the pack stabilizes answers", async () => { + const report = await runProbe(TASK, (h) => new FakeRunner(h), { + runs: 4, + packContext: "PACK BODY", + harnesses: ["claude"], + }); + assert.equal(report.schema, 1); + assert.equal(report.taskId, "demo"); + assert.equal(report.harnesses.length, 1); + const h = report.harnesses[0]; + assert.ok(h !== undefined); + // without-pack: 4 distinct answers → distinctRatio 1.0 + // with-pack: 1 distinct answer → distinctRatio 0.25 + if (h.without.dispersion.kind === "output_hash") { + assert.equal(h.without.dispersion.distinctRatio, 1); + } + if (h.with.dispersion.kind === "output_hash") { + assert.equal(h.with.dispersion.distinctRatio, 0.25); + } + assert.ok(Math.abs(h.dispersionDelta - 0.75) < 1e-9, "delta = 1.0 − 0.25"); + // token overhead: with=1200/run, without=1100/run → 1.0909... + assert.ok(h.tokenOverhead > 1 && h.tokenOverhead < 1.3); + assert.equal(h.tokenOverheadFlagged, false); + }); + + it("runs both harnesses by default", async () => { + const report = await runProbe(TASK, (h) => new FakeRunner(h), { + runs: 2, + packContext: "PACK", + }); + assert.deepEqual( + report.harnesses.map((h) => h.harness), + ["claude", "codex"], + ); + }); + + it("emits a byte-identical report across two identical probe runs (R6)", async () => { + const opts = { runs: 3, packContext: "PACK", harnesses: ["codex"] as Harness[] }; + const a = await runProbe(TASK, (h) => new FakeRunner(h), opts); + const b = await runProbe(TASK, (h) => new FakeRunner(h), opts); + assert.equal(serializeReport(a), serializeReport(b)); + }); + + it("defaults to DEFAULT_RUNS when runs is omitted", async () => { + const report = await runProbe(TASK, (h) => new FakeRunner(h), { + packContext: "PACK", + harnesses: ["claude"], + }); + assert.equal(report.harnesses[0]?.runs, DEFAULT_RUNS); + }); + + it("invokes the per-run progress callback for both arms", async () => { + const events: string[] = []; + await runProbe(TASK, (h) => new FakeRunner(h), { + runs: 2, + packContext: "PACK", + harnesses: ["claude"], + onRun: (e) => events.push(`${e.harness}:${e.arm}:${e.index}/${e.runs}`), + }); + assert.deepEqual(events, [ + "claude:without:1/2", + "claude:without:2/2", + "claude:with:1/2", + "claude:with:2/2", + ]); + }); +}); diff --git a/packages/eval/src/probe.ts b/packages/eval/src/probe.ts new file mode 100644 index 0000000..4432638 --- /dev/null +++ b/packages/eval/src/probe.ts @@ -0,0 +1,153 @@ +/** + * The variance-probe experiment loop (spec 010 §3). + * + * For each harness, for each arm (without-pack, with-pack), run the agent N + * times via the injected {@link AgentRunner}, capturing each outcome. Score the + * arm's outcomes with the task's oracle, sum tokens, and assemble a + * {@link HarnessReport}. The loop is harness- and inference-backend-agnostic: + * everything agent-specific lives behind the runner; everything model-specific + * (Bedrock wiring) lives in the runner impl. + * + * Controls that keep the number honest (§3): same instruction / commit / agent + * / model across both arms — the runner enforces this by taking identical + * inputs and flipping only `withPack`. Fresh session per run is the runner's + * responsibility (the CLI runner spawns a new process each call). + */ + +import { type ScoreOptions, scoreArm } from "./oracle.js"; +import { + type ArmReport, + buildHarnessReport, + type HarnessReport, + type VarianceReport, +} from "./report.js"; +import type { AgentRunner, Harness, RunOutcome } from "./runner.js"; +import type { Task } from "./task.js"; + +/** Default runs per arm (spec 010 §7.2). */ +export const DEFAULT_RUNS = 10; + +export interface ProbeOptions { + /** Runs per arm. Defaults to {@link DEFAULT_RUNS}. */ + readonly runs?: number; + /** + * Which harnesses to run. Defaults to the task's `harness` (one) or the + * full set ["claude", "codex"] when the task pins none (§7.3). + */ + readonly harnesses?: readonly Harness[]; + /** + * The OCH pack context to inject in the with-pack arm. The CLI generates + * this once per task and passes it in, so the probe never imports + * `@opencodehub/pack` (keeps the package graph acyclic). + */ + readonly packContext: string; + /** Required only when the task's oracle is `judge`. */ + readonly score?: ScoreOptions; + /** + * Per-run progress callback. Pure-side-channel — never affects the report. + */ + readonly onRun?: (event: ProbeRunEvent) => void; +} + +export interface ProbeRunEvent { + readonly harness: Harness; + readonly arm: "without" | "with"; + /** 1-based run index within the arm. */ + readonly index: number; + readonly runs: number; +} + +/** Resolve which harnesses to run from the task + options. */ +export function resolveHarnesses(task: Task, options: ProbeOptions): readonly Harness[] { + if (options.harnesses !== undefined && options.harnesses.length > 0) { + return options.harnesses; + } + if (task.harness !== undefined) return [task.harness]; + return ["claude", "codex"]; +} + +/** Sum an arm's per-run token accounting into {@link ArmReport} totals. */ +function sumTokens(outcomes: readonly RunOutcome[]): ArmReport["tokens"] { + let inputTokens = 0; + let outputTokens = 0; + let costUsd = 0; + let everyRunHadCost = true; + for (const o of outcomes) { + inputTokens += o.tokens.inputTokens; + outputTokens += o.tokens.outputTokens; + if (o.tokens.costUsd === null) everyRunHadCost = false; + else costUsd += o.tokens.costUsd; + } + return { inputTokens, outputTokens, costUsd: everyRunHadCost ? costUsd : null }; +} + +/** Run one arm: N invocations of the agent, then score + total tokens. */ +async function runArm( + runner: AgentRunner, + task: Task, + harness: Harness, + withPack: boolean, + options: ProbeOptions, + runs: number, +): Promise { + const packContext = options.packContext; + const outcomes: RunOutcome[] = []; + for (let i = 0; i < runs; i += 1) { + options.onRun?.({ harness, arm: withPack ? "with" : "without", index: i + 1, runs }); + const outcome = await runner.run({ + task, + harness, + withPack, + ...(withPack ? { packContext } : {}), + }); + outcomes.push(outcome); + } + const dispersion = await scoreArm(task.oracle, outcomes, options.score ?? {}); + return { dispersion, tokens: sumTokens(outcomes) }; +} + +/** + * Run the full probe for one harness: both arms + report assembly. + */ +export async function probeHarness( + runner: AgentRunner, + task: Task, + harness: Harness, + options: ProbeOptions, +): Promise { + const runs = options.runs ?? DEFAULT_RUNS; + // without-pack first, then with-pack — order is immaterial to the report + // (it's a pure function of the captured outcomes), but running the cheaper + // baseline arm first surfaces a misconfigured runner before the with-pack + // arm spends pack-inflated tokens. + const without = await runArm(runner, task, harness, false, options, runs); + const withPack = await runArm(runner, task, harness, true, options, runs); + return buildHarnessReport({ + harness, + runner: runner.name, + runs, + without, + with: withPack, + }); +} + +/** + * Run the variance probe across every resolved harness and assemble the + * top-level {@link VarianceReport}. + * + * `runnerFor` maps a harness to its runner — the CLI layer supplies a factory + * that returns the Bedrock-wired direct-CLI runner for each agent. Injecting it + * keeps the probe core free of any process-spawning code. + */ +export async function runProbe( + task: Task, + runnerFor: (harness: Harness) => AgentRunner, + options: ProbeOptions, +): Promise { + const harnesses = resolveHarnesses(task, options); + const reports: HarnessReport[] = []; + for (const harness of harnesses) { + reports.push(await probeHarness(runnerFor(harness), task, harness, options)); + } + return { schema: 1, taskId: task.id, harnesses: reports }; +} diff --git a/packages/eval/src/report.test.ts b/packages/eval/src/report.test.ts new file mode 100644 index 0000000..81c887d --- /dev/null +++ b/packages/eval/src/report.test.ts @@ -0,0 +1,121 @@ +import { strict as assert } from "node:assert"; +import { describe, it } from "node:test"; +import type { ArmDispersion } from "./dispersion.js"; +import { + type ArmReport, + buildHarnessReport, + formatReport, + serializeReport, + TOKEN_OVERHEAD_FLAG, + type VarianceReport, +} from "./report.js"; + +const assertionDispersion = (passRate: number, stddev: number): ArmDispersion => ({ + kind: "assertion", + passRate, + stddev, + runs: 10, +}); + +const arm = (stddev: number, input: number, output: number): ArmReport => ({ + dispersion: assertionDispersion(0.5, stddev), + tokens: { inputTokens: input, outputTokens: output, costUsd: null }, +}); + +describe("buildHarnessReport", () => { + it("computes the without − with dispersion delta (positive = pack helped)", () => { + const report = buildHarnessReport({ + harness: "claude", + runner: "cli:claude", + runs: 10, + without: arm(0.5, 1000, 200), + with: arm(0.2, 1100, 220), + }); + assert.ok(Math.abs(report.dispersionDelta - 0.3) < 1e-9, "0.5 − 0.2 = 0.3"); + }); + + it("computes token overhead as with/without totals", () => { + const report = buildHarnessReport({ + harness: "codex", + runner: "cli:codex", + runs: 10, + without: arm(0.5, 1000, 0), + with: arm(0.2, 1100, 0), + }); + assert.ok(Math.abs(report.tokenOverhead - 1.1) < 1e-9); + assert.equal(report.tokenOverheadFlagged, false, "1.1× is under the 1.3× flag"); + }); + + it("flags when token overhead exceeds the guardrail", () => { + const report = buildHarnessReport({ + harness: "claude", + runner: "cli:claude", + runs: 10, + without: arm(0.5, 1000, 0), + with: arm(0.2, 1400, 0), + }); + assert.ok(report.tokenOverhead > TOKEN_OVERHEAD_FLAG); + assert.equal(report.tokenOverheadFlagged, true); + }); + + it("reports overhead 0 (never Infinity/NaN) when the baseline arm spent no tokens", () => { + const report = buildHarnessReport({ + harness: "claude", + runner: "cli:claude", + runs: 1, + without: arm(0, 0, 0), + with: arm(0, 500, 100), + }); + assert.equal(report.tokenOverhead, 0); + assert.equal(report.tokenOverheadFlagged, false); + }); +}); + +describe("serializeReport (determinism, R6)", () => { + const report: VarianceReport = { + schema: 1, + taskId: "demo-task", + harnesses: [ + buildHarnessReport({ + harness: "claude", + runner: "cli:claude", + runs: 10, + without: arm(0.5, 1000, 200), + with: arm(0.2, 1100, 220), + }), + ], + }; + + it("is a pure function of the report (byte-identical across calls)", () => { + assert.equal(serializeReport(report), serializeReport(report)); + }); + + it("sorts object keys canonically (no clock/run-id leaks in)", () => { + const json = serializeReport(report); + // canonicalJson sorts keys: "harnesses" before "schema" before "taskId". + assert.ok(json.startsWith('{"harnesses":')); + assert.ok(!json.includes("Date")); + assert.ok(!json.includes("timestamp")); + }); +}); + +describe("formatReport", () => { + it("renders the flag marker when overhead is high", () => { + const flagged: VarianceReport = { + schema: 1, + taskId: "t", + harnesses: [ + buildHarnessReport({ + harness: "claude", + runner: "cli:claude", + runs: 10, + without: arm(0.5, 1000, 0), + with: arm(0.2, 1500, 0), + }), + ], + }; + const text = formatReport(flagged); + assert.ok(text.includes("FLAG")); + assert.ok(text.includes("token overhead")); + }); +}); diff --git a/packages/eval/src/report.ts b/packages/eval/src/report.ts new file mode 100644 index 0000000..ec31209 --- /dev/null +++ b/packages/eval/src/report.ts @@ -0,0 +1,145 @@ +/** + * The variance-probe report (spec 010 §5, R6). + * + * The report is a **pure function of the captured run outcomes** — no + * wall-clock, no run-id, no absolute paths. Two probe runs over the same + * captured outcomes serialize byte-identically (the context-bom discipline). + * Serialization goes through `core-types`' `canonicalJson` (sorted keys), so + * the emitted `--json` is reproducible. + * + * Token overhead is a first-class output, not a footnote: the paper's claim is + * "halves variance at ~10% more tokens", so a probe that halves variance at + * 3× tokens is a worse story. The report flags (never fails) when overhead + * exceeds {@link TOKEN_OVERHEAD_FLAG} — "you bought stability expensively." + */ + +import { canonicalJson } from "@opencodehub/core-types"; +import type { ArmDispersion } from "./dispersion.js"; +import { dispersionScalar } from "./dispersion.js"; + +/** + * Token-overhead guardrail (spec 010 §7.4). Above this ratio the report flags + * that stability was bought expensively. A reported constant, never a gate. + */ +export const TOKEN_OVERHEAD_FLAG = 1.3; + +/** Aggregate token totals for one arm. */ +export interface ArmTokens { + readonly inputTokens: number; + readonly outputTokens: number; + /** Sum of per-run cost when every run reported it; `null` otherwise. */ + readonly costUsd: number | null; +} + +/** One arm's measured result (with-pack or without-pack). */ +export interface ArmReport { + readonly dispersion: ArmDispersion; + readonly tokens: ArmTokens; +} + +/** The full per-harness probe result. */ +export interface HarnessReport { + /** Which agent produced this result (e.g. "claude", "codex"). */ + readonly harness: string; + /** Runner name (e.g. "cli:claude"). */ + readonly runner: string; + readonly runs: number; + readonly without: ArmReport; + readonly with: ArmReport; + /** + * `without − with` of the dispersion scalar. Positive = the pack reduced + * variance (the Move-2 claim). The headline number. + */ + readonly dispersionDelta: number; + /** with-pack tokens / without-pack tokens (total in+out). */ + readonly tokenOverhead: number; + /** True when `tokenOverhead` exceeds {@link TOKEN_OVERHEAD_FLAG}. */ + readonly tokenOverheadFlagged: boolean; +} + +/** The top-level report the probe emits. */ +export interface VarianceReport { + /** Report schema version, so consumers can branch on shape changes. */ + readonly schema: 1; + /** The task id this report measures. */ + readonly taskId: string; + /** One entry per harness the probe ran. */ + readonly harnesses: readonly HarnessReport[]; +} + +/** Sum input+output tokens for an arm. */ +function totalTokens(t: ArmTokens): number { + return t.inputTokens + t.outputTokens; +} + +/** + * Assemble a {@link HarnessReport} from two scored arms + their token totals. + * Pure: identical inputs → identical output. + */ +export function buildHarnessReport(input: { + readonly harness: string; + readonly runner: string; + readonly runs: number; + readonly without: ArmReport; + readonly with: ArmReport; +}): HarnessReport { + const dispersionDelta = + dispersionScalar(input.without.dispersion) - dispersionScalar(input.with.dispersion); + const withoutTotal = totalTokens(input.without.tokens); + const withTotal = totalTokens(input.with.tokens); + // Overhead is undefined when the baseline arm spent no tokens; report 0 in + // that degenerate case rather than Infinity/NaN, and never flag it. + const tokenOverhead = withoutTotal === 0 ? 0 : withTotal / withoutTotal; + return { + harness: input.harness, + runner: input.runner, + runs: input.runs, + without: input.without, + with: input.with, + dispersionDelta, + tokenOverhead, + tokenOverheadFlagged: tokenOverhead > TOKEN_OVERHEAD_FLAG, + }; +} + +/** + * Canonical JSON for the report (R6). Sorted keys, no clock/run-id — byte-stable + * across processes given the same captured outcomes. + */ +export function serializeReport(report: VarianceReport): string { + return canonicalJson(report); +} + +/** + * Render a short human-readable summary of a report. Kept separate from the + * machine JSON so the CLI can print one or the other. + */ +export function formatReport(report: VarianceReport): string { + const lines: string[] = []; + lines.push(`Variance probe — task: ${report.taskId}`); + for (const h of report.harnesses) { + lines.push(""); + lines.push(` ${h.harness} (${h.runner}, N=${h.runs})`); + lines.push(` without-pack dispersion: ${fmtDispersion(h.without.dispersion)}`); + lines.push(` with-pack dispersion: ${fmtDispersion(h.with.dispersion)}`); + lines.push(` delta (without − with): ${h.dispersionDelta.toFixed(4)}`); + lines.push( + ` token overhead: ${h.tokenOverhead.toFixed(3)}×` + + (h.tokenOverheadFlagged + ? ` [FLAG: > ${TOKEN_OVERHEAD_FLAG}× — stability bought expensively]` + : ""), + ); + } + return lines.join("\n"); +} + +function fmtDispersion(d: ArmDispersion): string { + switch (d.kind) { + case "output_hash": + return `distinct-output ratio ${d.distinctRatio.toFixed(4)}`; + case "assertion": + return `pass-rate ${d.passRate.toFixed(4)} (stddev ${d.stddev.toFixed(4)})`; + case "judge": + return `mean-score ${d.meanScore.toFixed(4)} (stddev ${d.stddev.toFixed(4)})`; + } +} diff --git a/packages/eval/src/runner.ts b/packages/eval/src/runner.ts new file mode 100644 index 0000000..d826284 --- /dev/null +++ b/packages/eval/src/runner.ts @@ -0,0 +1,86 @@ +/** + * `AgentRunner` — the harness-agnostic seam the probe drives (spec 010 §4). + * + * The probe core knows nothing about *how* an agent runs; it only asks a + * runner to execute a task in one arm and hand back the captured outcome. v1 + * ships a direct-CLI runner (`cli-runner.ts`) that shells out to `claude -p` / + * `codex exec` with Bedrock wired (§4a). A future omnigent-backed runner (v2) + * implements the same interface and drops in without touching the probe. + * + * Determinism note: a runner is free to be nondeterministic *within* an arm + * (that's the variance being measured) but must hold every controlled input — + * commit, instruction, agent, model — identical *between* arms. The only + * manipulated variable is `withPack`. + */ + +import type { Task } from "./task.js"; + +/** Which coding agent a runner drives. */ +export type Harness = "claude" | "codex"; + +/** Inputs handed to a runner for a single agent invocation. */ +export interface RunRequest { + /** The task being run (repo/commit/instruction). */ + readonly task: Task; + /** Which agent to drive. */ + readonly harness: Harness; + /** + * When true, the OCH code-pack for `repo@commit` is injected into the + * agent's context (the with-pack arm); when false, only the bare + * instruction is given (the without-pack arm). + */ + readonly withPack: boolean; + /** + * The pack context to inject when `withPack` is true. The CLI layer + * generates the pack once per task and threads it in here, so the probe + * core (and `@opencodehub/eval`) never depends on `@opencodehub/pack` — + * keeping the package boundary acyclic. Absent on the without-pack arm. + */ + readonly packContext?: string; +} + +/** Token accounting captured from one agent run. */ +export interface RunTokens { + readonly inputTokens: number; + readonly outputTokens: number; + /** Total cost in USD when the harness reports it; `null` when unavailable. */ + readonly costUsd: number | null; +} + +/** What a runner captures from a single agent invocation. */ +export interface RunOutcome { + /** The agent's final answer text. */ + readonly finalText: string; + /** + * The unified diff the agent produced, when the harness exposes one. + * Empty string when the run produced no patch or the harness doesn't + * surface diffs. + */ + readonly diff: string; + /** Token + cost accounting for the run. */ + readonly tokens: RunTokens; + /** + * Path to the (possibly mutated) repo checkout this run produced, so an + * `assertion` oracle can execute its check against the run's result. Absent + * when the runner does not materialize a per-run checkout. + */ + readonly checkoutPath?: string; + /** + * True when the agent invocation itself failed (non-zero exit, crash, + * timeout) as opposed to completing with a (possibly wrong) answer. A failed + * run still counts toward the arm — a pack that makes the agent crash less is + * lower variance — but the oracle treats it as the worst outcome. + */ + readonly errored: boolean; +} + +/** + * Runs an agent for a single task invocation. Implementations: the direct-CLI + * runner (v1) and, later, an omnigent-backed runner (v2). + */ +export interface AgentRunner { + /** Stable name for the report (e.g. "cli:claude", "cli:codex"). */ + readonly name: string; + /** Execute one invocation and capture its outcome. */ + run(request: RunRequest): Promise; +} diff --git a/packages/eval/src/task.test.ts b/packages/eval/src/task.test.ts new file mode 100644 index 0000000..eeff721 --- /dev/null +++ b/packages/eval/src/task.test.ts @@ -0,0 +1,116 @@ +import { strict as assert } from "node:assert"; +import { mkdtemp, rm, writeFile } from "node:fs/promises"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { after, before, describe, it } from "node:test"; +import { loadTask, TaskValidationError } from "./task.js"; + +describe("loadTask", () => { + let dir: string; + before(async () => { + dir = await mkdtemp(join(tmpdir(), "och-eval-task-")); + }); + after(async () => { + await rm(dir, { recursive: true, force: true }); + }); + + async function write(name: string, body: string): Promise { + const p = join(dir, name); + await write_(p, body); + return p; + } + async function write_(p: string, body: string): Promise { + await writeFile(p, body, "utf8"); + } + + it("loads a valid YAML task with an assertion oracle and applies defaults", async () => { + const p = await write( + "ok.yaml", + [ + "id: add-json-flag", + "repo: /tmp/some-repo", + "commit: abc123", + "instruction: Add a --json flag to status.", + "oracle:", + " type: assertion", + " command: npm test", + "", + ].join("\n"), + ); + const task = await loadTask(p); + assert.equal(task.id, "add-json-flag"); + assert.equal(task.oracle.type, "assertion"); + if (task.oracle.type === "assertion") { + assert.equal(task.oracle.command, "npm test"); + assert.equal(task.oracle.timeoutMs, 120_000, "default timeout applied"); + } + assert.equal(task.harness, undefined, "harness optional"); + }); + + it("loads a valid JSON task (yaml parser is a JSON superset)", async () => { + const p = await write( + "ok.json", + JSON.stringify({ + id: "t", + repo: "/r", + commit: "c", + instruction: "do it", + oracle: { type: "output_hash" }, + harness: "claude", + }), + ); + const task = await loadTask(p); + assert.equal(task.harness, "claude"); + assert.equal(task.oracle.type, "output_hash"); + if (task.oracle.type === "output_hash") { + assert.equal(task.oracle.field, "final_text", "default field applied"); + } + }); + + it("throws TaskValidationError on a missing file", async () => { + await assert.rejects(() => loadTask(join(dir, "nope.yaml")), TaskValidationError); + }); + + it("throws TaskValidationError on an empty file", async () => { + const p = await write("empty.yaml", ""); + await assert.rejects(() => loadTask(p), TaskValidationError); + }); + + it("throws TaskValidationError on a schema violation (missing instruction)", async () => { + const p = await write( + "bad.yaml", + ["id: t", "repo: /r", "commit: c", "oracle:", " type: output_hash", ""].join("\n"), + ); + await assert.rejects( + () => loadTask(p), + (err: unknown) => err instanceof TaskValidationError && /instruction/.test(err.message), + ); + }); + + it("rejects an unknown oracle type", async () => { + const p = await write( + "badoracle.yaml", + ["id: t", "repo: /r", "commit: c", "instruction: x", "oracle:", " type: telepathy", ""].join( + "\n", + ), + ); + await assert.rejects(() => loadTask(p), TaskValidationError); + }); + + it("rejects unknown top-level keys (strict schema)", async () => { + const p = await write( + "extra.yaml", + [ + "id: t", + "repo: /r", + "commit: c", + "instruction: x", + "oracle:", + " type: output_hash", + "surprise: true", + "", + ].join("\n"), + ); + await assert.rejects(() => loadTask(p), TaskValidationError); + }); +}); diff --git a/packages/eval/src/task.ts b/packages/eval/src/task.ts new file mode 100644 index 0000000..dda6288 --- /dev/null +++ b/packages/eval/src/task.ts @@ -0,0 +1,160 @@ +/** + * Task definition + loader for the variance probe. + * + * A **task** is the fixed unit the probe runs repeatedly (spec 010 §2): + * + * > a triple `(repo @ commit, instruction, success_oracle)` run by a coding + * > agent, where the agent's only variable input across the experiment is + * > whether the OCH pack is in its context. + * + * The task file is a small YAML or JSON document. We validate it with Zod so a + * malformed task surfaces a precise error rather than a cryptic runtime failure + * mid-experiment (the experiment costs real agent minutes — fail fast at load). + * + * Three oracle shapes, in increasing cost (§2): + * - `output_hash` — no scoring agent; dispersion = distinct-output ratio. + * - `assertion` — a deterministic shell check; dispersion = pass-rate stddev. + * - `judge` — an LLM-panel rubric; dispersion = stddev of scores. + */ + +import { readFile } from "node:fs/promises"; +import { parse as parseYaml, YAMLParseError } from "yaml"; +import { z } from "zod"; + +/** Oracle that scores each run by whether its output text differs across N. */ +export const OutputHashOracleSchema = z + .object({ + type: z.literal("output_hash"), + /** + * Which captured field to hash for the distinct-output ratio. `final_text` + * (the agent's answer) is the default; `diff` hashes the produced patch. + */ + field: z.enum(["final_text", "diff"]).default("final_text"), + }) + .strict(); + +/** Oracle that scores each run pass/fail via a deterministic shell command. */ +export const AssertionOracleSchema = z + .object({ + type: z.literal("assertion"), + /** + * Shell command run in the (post-run) repo checkout. Exit code 0 = pass, + * non-zero = fail. This is the most defensible "variance" — it's objective. + */ + command: z.string().min(1), + /** + * Optional working directory for the command, relative to the run's repo + * checkout. Defaults to the checkout root. + */ + cwd: z.string().optional(), + /** Per-command timeout in milliseconds. Defaults to 120_000. */ + timeoutMs: z.number().int().positive().default(120_000), + }) + .strict(); + +/** Oracle that scores each run 0..1 via an LLM-judge panel rubric. */ +export const JudgeOracleSchema = z + .object({ + type: z.literal("judge"), + /** The rubric handed to the judge panel, verbatim. */ + rubric: z.string().min(1), + /** Panel size — how many independent judge runs to average per outcome. */ + panel: z.number().int().positive().default(3), + }) + .strict(); + +export const OracleSchema = z.discriminatedUnion("type", [ + OutputHashOracleSchema, + AssertionOracleSchema, + JudgeOracleSchema, +]); + +export const TaskSchema = z + .object({ + /** Human-facing task id (used to label the report). */ + id: z.string().min(1), + /** Repo location — a local path or a clonable git URL. */ + repo: z.string().min(1), + /** Pinned commit SHA. Frozen so the pack is the only variable. */ + commit: z.string().min(1), + /** The natural-language ask, given verbatim to the agent every run. */ + instruction: z.string().min(1), + /** How a run is scored. */ + oracle: OracleSchema, + /** + * Optional harness selector. Omitted → the probe runs the configured + * default set (Claude Code + Codex). A value pins one agent. + */ + harness: z.enum(["claude", "codex"]).optional(), + }) + .strict(); + +export type OutputHashOracle = z.infer; +export type AssertionOracle = z.infer; +export type JudgeOracle = z.infer; +export type Oracle = z.infer; +export type Task = z.infer; + +export class TaskValidationError extends Error { + override readonly name = "TaskValidationError"; +} + +interface NodeFsError { + readonly code?: string; +} + +function isEnoent(err: unknown): boolean { + if (typeof err !== "object" || err === null) return false; + return (err as NodeFsError).code === "ENOENT"; +} + +/** + * Read + parse + validate a task file. YAML and JSON are both accepted — the + * `yaml` parser is a JSON superset, so one code path handles both. A missing + * file, malformed document, or schema violation throws {@link TaskValidationError} + * with a precise message, so the probe never starts an expensive experiment on + * a bad task. + */ +export async function loadTask(filePath: string): Promise { + let raw: string; + try { + raw = await readFile(filePath, "utf8"); + } catch (err) { + if (isEnoent(err)) { + throw new TaskValidationError(`task file not found: ${filePath}`); + } + throw err; + } + + let parsed: unknown; + try { + parsed = parseYaml(raw); + } catch (err) { + const message = + err instanceof YAMLParseError + ? err.message + : err instanceof Error + ? err.message + : String(err); + throw new TaskValidationError(`failed to parse ${filePath}: ${message}`); + } + + if (parsed === null || parsed === undefined) { + throw new TaskValidationError(`task file is empty: ${filePath}`); + } + + const result = TaskSchema.safeParse(parsed); + if (!result.success) { + throw new TaskValidationError(`invalid task in ${filePath}: ${formatZodError(result.error)}`); + } + return result.data; +} + +function formatZodError(error: z.ZodError): string { + const parts: string[] = []; + for (const issue of error.issues) { + const path = issue.path.length > 0 ? issue.path.map((seg) => String(seg)).join(".") : ""; + parts.push(`${path}: ${issue.message}`); + } + return parts.join("; "); +} diff --git a/packages/eval/tsconfig.json b/packages/eval/tsconfig.json new file mode 100644 index 0000000..60268a8 --- /dev/null +++ b/packages/eval/tsconfig.json @@ -0,0 +1,10 @@ +{ + "extends": "../../tsconfig.base.json", + "compilerOptions": { + "rootDir": "src", + "outDir": "dist", + "composite": true + }, + "include": ["src/**/*"], + "references": [{ "path": "../core-types" }] +} diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 4af5725..afefa57 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -173,6 +173,9 @@ importers: '@opencodehub/embedder': specifier: workspace:* version: link:../embedder + '@opencodehub/eval': + specifier: workspace:* + version: link:../eval '@opencodehub/ingestion': specifier: workspace:* version: link:../ingestion @@ -296,6 +299,25 @@ importers: specifier: 1.27.0 version: 1.27.0 + packages/eval: + dependencies: + '@opencodehub/core-types': + specifier: workspace:* + version: link:../core-types + yaml: + specifier: 2.9.0 + version: 2.9.0 + zod: + specifier: 4.4.3 + version: 4.4.3 + devDependencies: + '@types/node': + specifier: 26.0.1 + version: 26.0.1 + typescript: + specifier: 6.0.3 + version: 6.0.3 + packages/frameworks: dependencies: '@iarna/toml': @@ -7850,7 +7872,7 @@ snapshots: '@types/sax@1.2.7': dependencies: - '@types/node': 24.13.2 + '@types/node': 26.0.1 '@types/spdx-correct@3.1.3': {} @@ -11926,3 +11948,9 @@ snapshots: zod@4.4.3: {} zwitch@2.0.4: {} + +time: + '@types/node@26.0.1': '2026-06-24T20:33:01.352Z' + typescript@6.0.3: '2026-04-16T23:38:27.905Z' + yaml@2.9.0: '2026-05-11T10:16:24.045Z' + zod@4.4.3: '2026-05-04T07:06:40.819Z' diff --git a/tsconfig.json b/tsconfig.json index da0bfd8..9340841 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -11,6 +11,7 @@ { "path": "./packages/analysis" }, { "path": "./packages/pack" }, { "path": "./packages/policy" }, + { "path": "./packages/eval" }, { "path": "./packages/mcp" }, { "path": "./packages/cli" }, { "path": "./packages/summarizer" },