Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions .erpaval/specs/010-variance-probe/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Spec 010 — `pack --variance-probe`: measure the variance an OCH pack removes

**Status:** Approved — building (CLI-runner-first; see §7 locked answers).
**Author:** Bonk + Laith · **Date:** 2026-06-30
**Branch:** `spec/010-variance-probe` (off `main` @ `6b4d122`)
**Roadmap origin:** M-W-F run 2026-06-29, Move 2 (HIGH). Depends on the Move 6 ruling (below).
**Grounding validator:** arXiv:2606.26979 — deterministic anchoring "roughly halves run-to-run variance at ~10% more tokens."

**Decision log (2026-06-30, Laith):** "all looks good. agree on 5. cli first. just remember for claude code and codex to use bedrock for inference. go forth." → §7 answers locked; v1 ships the direct-CLI runner; omnigent runner deferred to v2; **Claude Code + Codex inference MUST route through Amazon Bedrock** (new §4a hard constraint).

**CLI-surface note:** the OCH code-pack command in this repo is `codehub code-pack` (the bare `codehub pack` is the legacy repomix snapshot). The probe attaches as `codehub code-pack --variance-probe <task-file>`. The spec text says `pack --variance-probe` as shorthand; the implementation wires onto `code-pack`.

---

## 0. The Move 6 ruling this spec is built on (decision-equivalence)

Laith ruled (2026-06-30): **pivot the contract from byte-identity to decision-equivalence.** This reframes what `--variance-probe` measures, so it's stated up front.

- **Old contract (struck):** "same `(commit, tokenizer, budget, pins)` ⇒ byte-identical pack." Brittle — the #252 embedder swap (F2LLM-v2-80M, 320-dim) breaks byte-identity while the retrieval decision is unchanged. An auditor doesn't care about bytes; they care whether the agent saw the right thing.
- **New contract:** "same inputs ⇒ provably the **same retrieval decision set** (same files + byte ranges selected under the same budget)." Byte-identity becomes one cheap *witness* of decision-equivalence, not the contract itself.
- **Why this matters for Move 2:** the variance probe is *how you measure the new contract holding*. If the pack genuinely pins the agent's decision, run-to-run answer variance drops. The probe turns "decision-equivalence" from an assertion into a number. So Move 6 (the contract) and Move 2 (the measurement) are one story: **Move 2 is the instrument that proves Move 6's claim.**
- Move 6 also implies a sibling `codehub replay` that asserts decision-equivalence structurally (same selection, not same bytes) — specced separately (011); 010 is the empirical/behavioral half.

---

## 1. Diagnosis — why this move, why now

OCH's pitch has leaned on the adjective "deterministic." That word is now contested vocabulary (LeanCTX "token receipt", Rel(AI)Build, the receipt-tool cluster all claim it). A **measured variance delta** is a number rivals can't claim by relabeling. arXiv:2606.26979 (Jun 2026) gives the citable result — deterministic anchoring halves agent run-to-run variance — and `--variance-probe` is how OCH demonstrates *its own* pack does this on a real repo, not just cites a paper.

The headline this earns: **"an OCH pack halves how much a coding agent's answer wanders run-to-run, at ~10% token cost"** — backed by a reproducible measurement the user can run.

## 2. The hard part: what is a "task"? (Laith's first question)

A "task" must be precise enough to run repeatedly and score. Definition:

> A **task** is a fixed triple `(repo @ commit, instruction, success_oracle)` run by a coding agent, where the agent's only variable input across the experiment is **whether the OCH pack is in its context**.

- **`repo @ commit`** — a pinned checkout. Frozen so the *only* variable is the pack.
- **`instruction`** — a natural-language ask given verbatim to the agent every run (e.g. "Add a `--json` flag to `codehub status` that prints the staleness record"). Stored as a string in the task file; never paraphrased between runs.
- **`success_oracle`** — how a run is scored. Three oracle types, in increasing cost:
1. **`output_hash`** (cheapest, no scoring agent) — variance = how often the agent's *final answer text / produced diff* differs across N runs. This measures raw output dispersion. Good for "did the agent converge".
2. **`assertion`** — a deterministic check the run either passes or fails (a test command exit code, a grep for a required symbol, a file-exists check). Variance = pass-rate dispersion (e.g. 6/10 with pack vs 3/10 without). This is the most defensible "variance" because it's objective.
3. **`judge`** (most expensive) — an LLM-judge panel scores each run's answer 0–1 on a rubric; variance = stddev of the scores. Reserved for tasks with no mechanical oracle.

**What "variance" means precisely:** for a task run N times in each arm (with-pack / without-pack), variance is a per-arm dispersion statistic over the N outcomes:
- `output_hash` oracle → **distinct-output ratio** = `(# distinct outputs) / N` (1.0 = every run different, 1/N = perfectly stable).
- `assertion` oracle → **failure-rate stddev** across N, or simply pass-rate (a stabler pass-rate IS lower behavioral variance).
- `judge` oracle → **stddev of rubric scores** across N.

The probe reports each arm's dispersion and the **delta** (without − with). The Move-2 claim holds when the with-pack arm's dispersion is materially lower.

## 3. The experiment design (with / without)

```
for arm in [without_pack, with_pack]:
for i in 1..N: # N = --runs, default 10
fresh agent session (no carryover)
context = instruction
+ (arm == with_pack ? the OCH code-pack for repo@commit : nothing)
run agent → capture (final_text, diff, oracle_result, tokens)
dispersion[arm] = oracleDispersion(results[arm])
report { without: dispersion[without_pack], with: dispersion[with_pack],
delta, tokenOverhead: tokens[with]/tokens[without] }
```

Controls that make the number honest:
- **Same instruction, same commit, same agent, same model** across both arms — the pack is the only manipulated variable.
- **Fresh session per run** — no conversational carryover contaminating variance.
- **Token overhead reported alongside** — the paper's claim is "halves variance at ~10% more tokens"; a probe that halves variance at 3× tokens is a different (worse) story, so cost is a first-class output, not a footnote.
- **Determinism of the probe itself** — temperature/seed are pinned where the harness allows; where it doesn't (most agents are nondeterministic by design), that nondeterminism IS the variance being measured, so it's left free *within* an arm but identical *between* arms.

## 4. Harness: how do we actually run the agent? (Laith's omnigent suggestion)

Grounded against the real repo (`omnigent-ai/omnigent`, 5,494★, Apache-2.0, Python 3.12, pushed 2026-06-30, **status: alpha**):

Omnigent is a **meta-harness** — one orchestration layer over Claude Code, Codex, Cursor, OpenCode, Kimi, and custom agents. An "agent" is a YAML (`executor.harness` + `prompt`); you run it with `omnigent run <agent.yaml> --harness <claude-sdk|codex|cursor|…>`, and a `sdks/python-client` drives it programmatically. **This is exactly the shape the probe needs**: one task definition, swap the harness flag, run headless, capture output.

**Why omnigent is the right call for this probe:**
- The variance story is far stronger if it holds **across agents** (Claude Code AND Codex), not just one — "the pack halves variance regardless of which agent reads it" is the defensible, agent-neutral claim. Omnigent is the only thing that drives both from one interface without per-agent glue.
- It already solves the headless-run + output-capture + sandbox problem the probe would otherwise hand-roll.

**The tradeoffs I'm not hiding:**
- **Alpha.** Pinning a specific commit/release is mandatory; its API may move. The probe must tolerate that (thin adapter, see §5).
- **Heavy dependency.** It pulls a server + sandbox providers. The probe should treat omnigent as an **optional, pluggable runner behind an interface**, NOT a hard dependency of `@opencodehub/eval`. Default to it; allow a simpler direct-CLI runner.
- **Credentials live in each harness's own login** (`claude` / `codex` CLI auth), not OCH config — fine for a local probe, a documented prerequisite.

**Recommendation:** define a small `AgentRunner` interface (`run(task, withPack) → {finalText, diff, tokens}`); ship an **omnigent-backed runner** as the default multi-agent implementation and a **direct-CLI runner** (shell out to `claude -p` / `codex exec`) as the dependency-light fallback. The probe logic is harness-agnostic; the runner is swappable. This keeps an alpha dep from being load-bearing while getting the cross-agent story omnigent uniquely enables.

**v1 decision (Laith, 2026-06-30):** ship the **direct-CLI runner first**. omnigent (alpha) is deferred to v2. The `AgentRunner` interface is defined now so the omnigent runner drops in later without touching the probe core.

## 4a. Bedrock inference constraint (hard requirement, Laith 2026-06-30)

Both agents the probe drives **must run inference on Amazon Bedrock**, not the Anthropic/OpenAI first-party APIs. This is grounded against current docs (`code.claude.com/docs`, `developers.openai.com/codex`), not recalled.

**Claude Code → Bedrock** — the runner sets these in the child process env before `claude -p`:
- `CLAUDE_CODE_USE_BEDROCK=1` — route Claude Code through Bedrock.
- `AWS_REGION` — e.g. `us-east-1`; credentials resolve via the default AWS SDK chain (env keys / `AWS_PROFILE` / SSO / `AWS_BEARER_TOKEN_BEDROCK`). The probe inherits the operator's AWS env; it does not manage creds.
- `ANTHROPIC_MODEL` — a **cross-region inference-profile ID** (the `us.` prefix is required for on-demand throughput), e.g. `us.anthropic.claude-sonnet-4-6`. Configurable per task/run; defaults to a sonnet inference profile.
- Headless invocation: `claude -p "<prompt>" --output-format json --model <inference-profile-id>`.
- Token usage read from the single JSON result object: `.usage.input_tokens`, `.usage.output_tokens`, `.total_cost_usd`; final answer at `.result`.
- **No temperature/seed flag exists** in the Claude Code CLI (verified). This is *consistent with* the probe design (§3): within-arm sampling nondeterminism is the variance being measured, held free within an arm and identical between arms.

**Codex → Bedrock** — Codex ships a first-party `amazon-bedrock` provider (OpenAI-compatible "Bedrock Mantle" surface, AWS-native SigV4 / bearer-token auth). The runner invokes:
- `codex exec --json -c model_provider=amazon-bedrock -m <model> --skip-git-repo-check "<prompt>"` (model e.g. `openai.gpt-5.5`).
- Auth: `AWS_BEARER_TOKEN_BEDROCK` + `AWS_REGION`, or the AWS SDK credential chain (`AWS_PROFILE` / env keys). Commercial AWS regions only (GovCloud unsupported).
- Output is JSONL: final answer = last `item.completed` of `type:"agent_message"` (its `.text`); token usage = last `turn.completed.usage` = `{input_tokens, cached_input_tokens, output_tokens, reasoning_output_tokens}` (no `total` field — the runner sums them).

The runner centralizes this Bedrock env/flag wiring per agent so the probe core stays inference-backend-agnostic; a future omnigent runner sets the same env on whichever harness it spawns.

## 5. Where it lives + shape

- **`packages/eval`** — currently a stub (README only). This is its first real content: `@opencodehub/eval` gains the variance-probe core (task loading, the experiment loop, dispersion stats, the `AgentRunner` interface + the two runner impls).
- **CLI surface:** `codehub pack --variance-probe <task-file> [--runs N] [--harness <h>] [--runner omnigent|cli]` → prints the per-arm dispersion + delta + token overhead; `--json` emits the full record.
- **Task file:** a small YAML/JSON: `{ repo, commit, instruction, oracle: {type, ...}, harness? }`.
- **Determinism of the probe's own output:** the *report* is a pure function of the captured run outcomes; no clock/run-id in the emitted record (same discipline as the context-bom), so two probe runs over the same captured outcomes serialize identically. (The agent runs themselves are nondeterministic by nature — that's the point.)

## 6. EARS requirements (draft — for review, not yet final)

- **R1** WHEN given a task file `(repo, commit, instruction, oracle)`, the probe SHALL run the agent N times (default 10) in each of the with-pack and without-pack arms, holding commit/instruction/agent/model fixed.
- **R2** The probe SHALL compute a per-arm dispersion statistic appropriate to the oracle type (distinct-output ratio / pass-rate stddev / judge-score stddev) and report the without−with delta.
- **R3** The probe SHALL report token overhead (with-pack tokens / without-pack tokens) alongside the variance delta — variance reduction is only meaningful against its cost.
- **R4** The agent runner SHALL be an interface with at least two implementations: an omnigent-backed multi-agent runner (default) and a direct-CLI runner (no omnigent dependency).
- **R5** WHERE omnigent (alpha) is unavailable or its API has drifted, the probe SHALL fall back to / be usable via the direct-CLI runner without code change to the probe core.
- **R6** The emitted `--json` report SHALL be a pure function of the captured run outcomes (no wall-clock/run-id), so the report serialization is reproducible given the same outcomes.
- **R7** The probe SHALL pin the omnigent version it was validated against and surface a clear error if the installed version mismatches. *(Deferred with the omnigent runner to v2; the version-pin guard ships when that runner does.)*
- **R8** The direct-CLI runner SHALL route both agents' inference through Amazon Bedrock (§4a): for Claude Code it SHALL set `CLAUDE_CODE_USE_BEDROCK=1`, `AWS_REGION`, and a `us.`-prefixed inference-profile `ANTHROPIC_MODEL`; for Codex it SHALL pass `-c model_provider=amazon-bedrock`. The runner SHALL surface a clear, actionable error when the agent binary is absent or Bedrock auth/region is unconfigured, rather than silently falling back to a first-party API.

## 7. Decisions (locked 2026-06-30 — answers to the prior open questions)

1. **Default oracle → `assertion`.** Objective and defensible. `output_hash` is the zero-config quick look; `judge` is opt-in for tasks with no mechanical oracle. The headline number is the assertion pass-rate dispersion delta.
2. **N (runs per arm) → 10, configurable** via `--runs`. Smallest N that gives a believable dispersion; raise for a more defensible publish at linear agent-cost.
3. **Agents → Claude Code AND Codex.** The agent-neutral claim ("the pack halves variance regardless of which agent reads it") is the point. v1 ships both via the direct-CLI runner; `--harness` selects one or the default runs the configured set.
4. **Token-overhead guardrail → yes, flag (never fail).** When variance drops but token overhead exceeds ~1.3×, the report flags "stability bought expensively." Keeps the claim honest; the threshold is a reported constant, not a gate.
5. **Omnigent vs CLI-first → CLI-runner first.** v1 = probe core + direct-CLI runner (fast, dependency-light, proves the method, Bedrock-wired per §4a). omnigent runner is v2, dropping in behind the same `AgentRunner` interface.

## 8. What this is NOT (scope guard)

- Not a SWE-bench-style correctness benchmark — it measures *dispersion*, not capability. (SWE-Explore / SWE-bench remain the publish-against targets, separate.)
- Not the `replay`/decision-equivalence structural check (spec 011, the other half of Move 6).
- Not a CI gate — it's an on-demand measurement (agent runs cost real money + minutes).
1 change: 1 addition & 0 deletions commitlint.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ export default {
"cobol-proleap",
"core-types",
"embedder",
"eval",
"frameworks",
"ingestion",
"mcp",
Expand Down
3 changes: 2 additions & 1 deletion packages/cli/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
"test": "pnpm run build:test && node --test \"./dist-test/**/*.test.js\"",
"clean": "rm -rf dist dist-test *.tsbuildinfo"
},
"//deps": "The 14 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).",
"//deps": "The 15 @opencodehub/* workspace libs are INLINED into the bundle at build time (tsup noExternal) — they are devDependencies, not runtime deps. `dependencies` below is exactly the third-party set the bundle imports at runtime (kept `external`), plus the two @sourcegraph/scip-* indexers the parse pipeline spawns as subprocesses. onnxruntime-web (prebuilt WASM, no native binding) is optional (lazy-loaded only when embeddings are enabled).",
"dependencies": {
"@apidevtools/swagger-parser": "12.1.0",
"@aws-sdk/client-bedrock-runtime": "3.1075.0",
Expand Down Expand Up @@ -70,6 +70,7 @@
"@opencodehub/analysis": "workspace:*",
"@opencodehub/core-types": "workspace:*",
"@opencodehub/embedder": "workspace:*",
"@opencodehub/eval": "workspace:*",
"@opencodehub/ingestion": "workspace:*",
"@opencodehub/mcp": "workspace:*",
"@opencodehub/pack": "workspace:*",
Expand Down
Loading
Loading