Skip to content

Benchmark: blast radius (Track A scream + Track B agent comparison) #143

@jonathanpopham

Description

@jonathanpopham

Tracking the blast radius benchmark that will back the next deep-dive article.

Status (2026-04-23)

Track A — scream test (graph vs reality)

Done. Standalone Python harness that mutates a function, runs Django's test suite, compares what breaks against /v1/analysis/impact predictions.

  • Scope: django/contrib/auth/ — 18 measured targets
  • Macro F1: 56.9% (precision 44.2%, recall 94.9%)
  • Micro F1: 59.9% (precision 44.2%, recall 93.0%) — TP 159, FP 201, FN 12
  • Strongest performers: hashers.make_password / hashers.get_hasher — 100% recall, 85% precision, 92% F1
  • Known gap: 12 FN from login_required / user_passes_test decorators — parser appears to miss test methods decorated at call time

Artifacts in benchmark/scream/:

  • scream_test.py — AST-based mutation loop
  • api_predict.py — batched call to /v1/analysis/impact
  • compare.py — precision/recall/F1 (scoped + test-case-only filtering)
  • summarize.py — metrics → markdown
  • results/metrics_fair.json + results/README.md

Track B — agent comparison (blast radius context → refactor outcome)

Pilot complete, scope problem identified.

Setup: Two Docker containers (bench-br-naked, bench-br-supermodel) built on Django 5.0.6 + Claude Code + Opus 4.7 (pinned via `--model claude-opus-4-7`). Task: make `authenticate(request=None, **credentials)` require `request`, keep tests green.

Pilot run #1 (test scope = `auth_tests` only):

condition verdict turns tools cost duration files
naked PASS 58 41 $1.56 205s 7
supermodel PASS 59 43 $1.77 203s 7

Effectively tied. Opus 4.7 is capable enough to grep-and-iterate within a single subsystem. The BLAST_RADIUS.md context added cost (bigger input) without changing outcome.

Pilot run #2 in progress (test scope expanded to 13 Django subsystems, 3,293 tests — auth_tests, admin_views, admin_changelist, admin_utils, admin_inlines, sessions_tests, middleware_exceptions, generic_views, forms_tests, contenttypes_tests, handlers, view_tests, test_client_regress). Hypothesis: naked agent will now miss callers outside auth/.

Artifacts in `benchmark/`:

  • `Dockerfile.br-naked`, `Dockerfile.br-supermodel`
  • `entrypoint.br.sh`
  • `CLAUDE.br-naked.md`, `CLAUDE.br-supermodel.md`
  • `br_task.md` — refactor spec
  • `BLAST_RADIUS.md` — pre-computed from Track A's `predicted.json`
  • `run-br.sh` — orchestrator
  • `compare-br.sh` — results table
  • `results/br/`

Known issues to resolve

  • Expanded-scope pilot may still tie. If so, pivot to:
    • Measure cross-subsystem risk acknowledgement in agent summary (qualitative)
    • Pick a task with polymorphic / getattr-based callers (harder for grep)
    • Pin a weaker model (Sonnet 4.6) to widen the gap
  • BLAST_RADIUS.md contains obvious noise (migration files listed as 'callers'). Tighten filtering in the rendering script.
  • Track A recall gaps on decorator targets — parser investigation.
  • Full Django test suite hits `TypeError: cannot pickle 'traceback' object` under `--parallel 4` on macOS. Workaround: `--parallel 1` and scoped subsystems. Worth filing upstream.

Article plan

Deep-dive: Your agent thinks it's editing 3 files. It's editing 145. — receipt-worthy "things that will break" number, Jenga/ripple analogy, push `/v1/analysis/impact` + `supermodel blast-radius`.

Standalone pullout (4th-grade brainrot): "Before you move one block, check who's standing on it."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions