Skip to content

Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502

Open
neh8 wants to merge 3 commits into
mainfrom
claude/scenario-eval-pipeline-fixes
Open

Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502
neh8 wants to merge 3 commits into
mainfrom
claude/scenario-eval-pipeline-fixes

Conversation

@neh8

@neh8 neh8 commented Jun 29, 2026

Copy link
Copy Markdown

Summary

Fixes and hardens the experiment/scripts/scenario_design colleague-evaluation pipeline, strengthens the colleague's information-gating behavior, and adds a written-up A/B experiment. Three commits; the diff vs main is exactly this work.

Pipeline bug fixes — Fix and harden scenario_design eval pipeline

The eval scripts were silently not working on Windows checkouts:

  • Criteria never loaded (CRLF). loadCriteria() split criteria.md on \n, leaving a trailing \r that broke the title regex, so it returned 0 criteria — every judgment and probe passed vacuously. Now normalizes CRLF.
  • Probes passed on empty verdicts. A probe whose criteria didn't resolve counted as a pass. Now a probe fails unless every targeted criterion was judged.
  • Response Format Compliance was untestable. simulate.ts/probe.ts joined the colleague's JSON array into plaintext before judging, so criterion 8 could never pass. The colleague's verbatim output is now preserved (raw) and judged directly (seed messages included).
  • Judge crashed on pre-existing probe files. Log-file discovery didn't exclude _probes.json, so judging crashed mid-run (parsing an array as a conversation). Filter now excludes it, with a guard for non-log files.

Colleague behavior — Strengthen colleague information gating against vague questions

Vague/over-broad questions ("anything else?", "what's going on?", "tell me everything") were making the colleague dump key facts — an Information Gating violation and a confound for the study's process measure. Both scenarios' prompts now deflect open-ended questions to specifics while still answering direct questions. Verified: the "Vague follow-up" probe and the vague/offloader archetypes now pass info-gating on both scenarios.

Experiment — Document criteria-injection A/B experiment

Tested whether putting the passing criteria directly into the colleague prompt (replacing the crafted persona rules with the 9 raw criteria.md rules) changes results. Result: a tie — both versions ~98% multi-turn / 100% probes across both scenarios. Failure modes differ (baseline's substantive miss is Refusal-to-Draft/content-coaching; the criteria-injected version's only misses are Response Format). Full writeup, prompts, and caveats in experiment/scripts/scenario_design/experiments/criteria_injection_experiment.md.

Verification

  • After fixes, both scenarios: probes 7/7 (within the 20s latency budget) and the vague/offloader archetypes 9/9 criteria.
  • npx tsc --noEmit clean for the scenario_design scripts.

Notes

  • The live study prompt (scenarios.json) is the info-gating version; no criteria-injected content was committed to it.
  • The criteria-injected variant scenarios and a scratch generator were used only for the experiment and are intentionally not committed (git-ignored outputs/ + a _-prefixed scratch file).

Out of scope / follow-ups

  • Content coaching: in both versions the colleague tends to advise how to phrase the email, skirting Refusal-to-Draft and partly defeating the measurement. Tabled for a follow-up.
  • The experiment is n=1 per archetype/version; pass rates aren't statistically distinguishable — repeated runs would give error bars.

🤖 Generated with Claude Code

neh8 and others added 3 commits June 29, 2026 16:24
- judge.ts: normalize CRLF when parsing criteria.md so loadCriteria works on
  Windows checkouts. Previously it returned 0 criteria, so every judgment and
  probe passed vacuously.
- judge.ts: exclude _probes.json from conversation-log discovery and skip files
  without a messages array. Previously judging crashed mid-run when a probe
  results file already existed.
- probe.ts: a probe now fails unless every targeted criterion was actually
  judged, so an unresolved criterion can no longer pass vacuously on an empty
  verdict list.
- simulate.ts / probe.ts / judge.ts: preserve the colleague's verbatim JSON
  output (`raw`) and judge Response Format Compliance against it instead of the
  harness's joined plaintext, which the criterion could never satisfy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a rule to both scenarios' colleague system prompts so open-ended questions
("anything else?", "what's going on?", "tell me everything") get deflected to a
specific question instead of triggering an information dump. Specific questions
are still answered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Supervisor-facing writeup comparing crafted behavior rules vs. injecting the raw
criteria.md into the colleague prompt (a tie at ~98% pass rate), with method,
results, failure analysis, caveats, the colleague prompts used, and the pipeline
bugs found and fixed during the work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant