Benchmark: blast radius (Track A scream + Track B agent comparison)

Tracking the blast radius benchmark that will back the next deep-dive article.

## Status (2026-04-23)

### Track A — scream test (graph vs reality)
Done. Standalone Python harness that mutates a function, runs Django's test suite, compares what breaks against `/v1/analysis/impact` predictions.

- Scope: `django/contrib/auth/` — 18 measured targets
- **Macro F1: 56.9%** (precision 44.2%, recall 94.9%)
- **Micro F1: 59.9%** (precision 44.2%, recall 93.0%) — TP 159, FP 201, FN 12
- Strongest performers: `hashers.make_password` / `hashers.get_hasher` — 100% recall, 85% precision, 92% F1
- Known gap: 12 FN from `login_required` / `user_passes_test` decorators — parser appears to miss test methods decorated at call time

Artifacts in `benchmark/scream/`:
- `scream_test.py` — AST-based mutation loop
- `api_predict.py` — batched call to `/v1/analysis/impact`
- `compare.py` — precision/recall/F1 (scoped + test-case-only filtering)
- `summarize.py` — metrics → markdown
- `results/metrics_fair.json` + `results/README.md`

### Track B — agent comparison (blast radius context → refactor outcome)
Pilot complete, scope problem identified.

**Setup:** Two Docker containers (`bench-br-naked`, `bench-br-supermodel`) built on Django 5.0.6 + Claude Code + Opus 4.7 (pinned via \`--model claude-opus-4-7\`). Task: make \`authenticate(request=None, **credentials)\` require \`request\`, keep tests green.

**Pilot run #1** (test scope = \`auth_tests\` only):
| condition | verdict | turns | tools | cost | duration | files |
|---|---|---|---|---|---|---|
| naked | PASS | 58 | 41 | \$1.56 | 205s | 7 |
| supermodel | PASS | 59 | 43 | \$1.77 | 203s | 7 |

Effectively tied. Opus 4.7 is capable enough to grep-and-iterate within a single subsystem. The BLAST_RADIUS.md context added cost (bigger input) without changing outcome.

**Pilot run #2 in progress** (test scope expanded to 13 Django subsystems, 3,293 tests — auth_tests, admin_views, admin_changelist, admin_utils, admin_inlines, sessions_tests, middleware_exceptions, generic_views, forms_tests, contenttypes_tests, handlers, view_tests, test_client_regress). Hypothesis: naked agent will now miss callers outside auth/.

Artifacts in \`benchmark/\`:
- \`Dockerfile.br-naked\`, \`Dockerfile.br-supermodel\`
- \`entrypoint.br.sh\`
- \`CLAUDE.br-naked.md\`, \`CLAUDE.br-supermodel.md\`
- \`br_task.md\` — refactor spec
- \`BLAST_RADIUS.md\` — pre-computed from Track A's \`predicted.json\`
- \`run-br.sh\` — orchestrator
- \`compare-br.sh\` — results table
- \`results/br/\`

## Known issues to resolve

- [ ] Expanded-scope pilot may still tie. If so, pivot to:
  - Measure cross-subsystem risk acknowledgement in agent summary (qualitative)
  - Pick a task with polymorphic / getattr-based callers (harder for grep)
  - Pin a weaker model (Sonnet 4.6) to widen the gap
- [ ] BLAST_RADIUS.md contains obvious noise (migration files listed as 'callers'). Tighten filtering in the rendering script.
- [ ] Track A recall gaps on decorator targets — parser investigation.
- [ ] Full Django test suite hits \`TypeError: cannot pickle 'traceback' object\` under \`--parallel 4\` on macOS. Workaround: \`--parallel 1\` and scoped subsystems. Worth filing upstream.

## Article plan

Deep-dive: *Your agent thinks it's editing 3 files. It's editing 145.* — receipt-worthy \"things that will break\" number, Jenga/ripple analogy, push \`/v1/analysis/impact\` + \`supermodel blast-radius\`.

Standalone pullout (4th-grade brainrot): *\"Before you move one block, check who's standing on it.\"*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: blast radius (Track A scream + Track B agent comparison) #143

Status (2026-04-23)

Track A — scream test (graph vs reality)

Track B — agent comparison (blast radius context → refactor outcome)

Known issues to resolve

Article plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark: blast radius (Track A scream + Track B agent comparison) #143

Description

Status (2026-04-23)

Track A — scream test (graph vs reality)

Track B — agent comparison (blast radius context → refactor outcome)

Known issues to resolve

Article plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions