From 8ac63358940d66d997526ad678a22cfa4f4b3fb8 Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Tue, 30 Jun 2026 17:31:23 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20finding=200001=20=E2=80=94=20OCH=20pack?= =?UTF-8?q?=20cuts=20agent=20token=20usage=202=E2=80=934=C3=97=20(live=20M?= =?UTF-8?q?ove=202=20data)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First live measurement from the Move 2 variance probe on Bedrock. Two tasks × N=5 × Claude Sonnet 4.5 against an isolated @opencodehub/policy snapshot: the pack cut total token usage 2.18×–4.08× and cost 1.9×–3.3×, driven almost entirely by cache tokens (exploration the agent skipped because the pack handed it the structure). Output tokens barely moved, so the saving is exploration-avoided, not shorter answers. Reframes the headline from variance to token efficiency: output_hash dispersion came back null because it compares answer TEXT and a frontier model rephrases prose every run (saturated) — the variance question waits on the judge oracle. Token efficiency is a directly-measured resource number with no saturation, and it replicated across both task regimes. Scoped honestly as preliminary (N=5, 1 repo, 1 agent, 2 tasks — a signal, not a benchmark) with a reproduce recipe and next steps. --- docs/findings/0001-pack-token-efficiency.md | 100 ++++++++++++++++++++ 1 file changed, 100 insertions(+) create mode 100644 docs/findings/0001-pack-token-efficiency.md diff --git a/docs/findings/0001-pack-token-efficiency.md b/docs/findings/0001-pack-token-efficiency.md new file mode 100644 index 0000000..75cd137 --- /dev/null +++ b/docs/findings/0001-pack-token-efficiency.md @@ -0,0 +1,100 @@ +# Finding 0001 — An OCH pack cuts a coding agent's token usage 2–4× on real tasks + +- Status: **Preliminary** — first live measurement, 2026-06-30. +- Author: Bonk + Laith. +- Instrument: `codehub code-pack --variance-probe` (Move 2, spec 010), the + direct-CLI runner on Amazon Bedrock. +- Scope guard: 2 tasks × 5 runs/arm × 1 agent (Claude Code, Sonnet 4.5) × 1 + repo. This is a signal, not a benchmark. See "What this is not" below. + +## The headline + +Giving a coding agent an OpenCodeHub pack (the symbol skeleton + file-tree + +deps + xrefs map) instead of letting it explore the repo cut its **total token +usage 2.18×–4.08×** and its **dollar cost 1.9×–3.3×** across two tasks — while +producing the same quality of answer. The agent stopped re-reading files and +running tools to reconstruct structure it was handed up front. + +This is the opposite direction from the variance-anchoring literature +(arXiv:2606.26979, "deterministic anchoring halves run-to-run variance at ~10% +*more* tokens"). OCH's pack does not *add* context on top of exploration — it +*replaces* the exploration. On a structure-discovery task that makes it +cheaper, not dearer. + +## The measurement + +Two tasks against an isolated snapshot of `@opencodehub/policy` (4 source files, +indexed to 106 graph nodes / 181 edges; pack `9fe66179`). Each task ran the +agent 5 times with the pack in context and 5 times without, holding +commit / instruction / agent / model fixed. Token totals include the cached +system prompt Claude Code injects per call (`cache_creation` + `cache_read`), +which dominates the count and was the subject of the bug fix in PR #271. + +| Task | Arm | Total tokens | of which cache | Cost (5 runs) | +|---|---|---:|---:|---:| +| **A.** open-ended: "name the exact files + symbols to edit to add a `max_file_count` rule type" | without pack | 658,318 | 653,571 | $0.6412 | +| | with pack | 161,285 | 157,349 | $0.1965 | +| | **delta** | **4.08× fewer** | | **3.26× cheaper** | +| **B.** enumeration: "list every exported function and type" | without pack | 623,098 | 617,267 | $0.6969 | +| | with pack | 286,379 | 283,049 | $0.3644 | +| | **delta** | **2.18× fewer** | | **1.91× cheaper** | + +The reduction is almost entirely **cache tokens** — the tokens the agent spends +reading files and running tools to reconstruct the codebase's shape. Without +the pack the agent burned 617K–654K such tokens per arm; with the pack, handed +the structure directly, it spent 157K–283K and stopped hunting. Output tokens +(the answer itself) barely moved (3.3K–5.6K), confirming the saving comes from +*exploration avoided*, not *shorter answers*. + +## Why "variance" was the wrong headline + +Move 2 was specced to measure run-to-run answer *variance* (does the pack make +the agent's answer wander less?). On these tasks the `output_hash` dispersion +metric came back **null** (delta 0, and −0.2 on Task B — noise at N=5). The +reason is mechanical, not a pack failure: `output_hash` compares answer *text*, +and a frontier model rephrases a free-text answer slightly every run, so every +answer hashes as distinct regardless of context. Measuring *decision* +convergence on prose needs the `judge` oracle (semantic-equivalence scoring), +which the CLI does not yet wire — tracked as the next gap. + +Token efficiency, by contrast, is a directly measured resource number with no +such saturation problem — and it replicated cleanly across both task regimes. +It is the more defensible claim. + +## What this is NOT + +- **Not a benchmark.** N=5, one small repo, one agent, one model. The 2–4× + range is a real signal on these tasks, not a published figure. A defensible + number needs more tasks, more repos, larger N, and the second agent (Codex). +- **Not a variance result.** The variance question is still open pending the + judge oracle (see above). +- **Not a correctness claim.** The probe did not score answer correctness here + (the `output_hash` oracle only checks textual identity). The token saving is + real; "same quality" is an eyeball judgment on the answers, not a graded one. + +## Reproduce + +``` +# 1. Analyze a target repo so a pack can be generated. +codehub analyze /path/to/repo --no-scan + +# 2. Write a task file (see packages/eval/examples/variance-task.yaml). +# 3. Run the probe (Claude on Bedrock, instance-role creds): +CLAUDE_CODE_USE_BEDROCK=1 AWS_REGION=us-east-1 \ + codehub code-pack --variance-probe task.yaml \ + --runs 5 --harness claude \ + --model-claude us.anthropic.claude-sonnet-4-5-20250929-v1:0 --json +``` + +The emitted JSON reports per-arm `tokens` (`inputTokens` + `outputTokens` + +`cacheTokens`) and `tokenOverhead` (with/without total); a value below 1.0 +means the pack reduced tokens. + +## Next + +1. Wire a `JudgeScorer` into `runVarianceProbe` so the `judge` oracle works + end-to-end — unblocks the variance measurement on open-ended tasks. +2. Scale the token measurement: more tasks (build/fix/explain regimes), a + second repo, the Codex arm, larger N — turn the 2–4× signal into a figure. +3. Revisit `DEFAULT_CLAUDE_MODEL` (`us.anthropic.claude-sonnet-4-6`): not + confirmed available in the test account; sonnet-4-5 was used.