Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment) by neh8 · Pull Request #502 · AIToolsLab/writing-tools

neh8 · 2026-06-29T22:24:30Z

Summary

Fixes and hardens the experiment/scripts/scenario_design colleague-evaluation pipeline, strengthens the colleague's information-gating behavior, and adds a written-up A/B experiment. Three commits; the diff vs main is exactly this work.

Pipeline bug fixes — Fix and harden scenario_design eval pipeline

The eval scripts were silently not working on Windows checkouts:

Criteria never loaded (CRLF). loadCriteria() split criteria.md on \n, leaving a trailing \r that broke the title regex, so it returned 0 criteria — every judgment and probe passed vacuously. Now normalizes CRLF.
Probes passed on empty verdicts. A probe whose criteria didn't resolve counted as a pass. Now a probe fails unless every targeted criterion was judged.
Response Format Compliance was untestable. simulate.ts/probe.ts joined the colleague's JSON array into plaintext before judging, so criterion 8 could never pass. The colleague's verbatim output is now preserved (raw) and judged directly (seed messages included).
Judge crashed on pre-existing probe files. Log-file discovery didn't exclude _probes.json, so judging crashed mid-run (parsing an array as a conversation). Filter now excludes it, with a guard for non-log files.

Colleague behavior — Strengthen colleague information gating against vague questions

Vague/over-broad questions ("anything else?", "what's going on?", "tell me everything") were making the colleague dump key facts — an Information Gating violation and a confound for the study's process measure. Both scenarios' prompts now deflect open-ended questions to specifics while still answering direct questions. Verified: the "Vague follow-up" probe and the vague/offloader archetypes now pass info-gating on both scenarios.

Experiment — Document criteria-injection A/B experiment

Tested whether putting the passing criteria directly into the colleague prompt (replacing the crafted persona rules with the 9 raw criteria.md rules) changes results. Result: a tie — both versions ~98% multi-turn / 100% probes across both scenarios. Failure modes differ (baseline's substantive miss is Refusal-to-Draft/content-coaching; the criteria-injected version's only misses are Response Format). Full writeup, prompts, and caveats in experiment/scripts/scenario_design/experiments/criteria_injection_experiment.md.

Verification

After fixes, both scenarios: probes 7/7 (within the 20s latency budget) and the vague/offloader archetypes 9/9 criteria.
npx tsc --noEmit clean for the scenario_design scripts.

Notes

The live study prompt (scenarios.json) is the info-gating version; no criteria-injected content was committed to it.
The criteria-injected variant scenarios and a scratch generator were used only for the experiment and are intentionally not committed (git-ignored outputs/ + a _-prefixed scratch file).

Out of scope / follow-ups

Content coaching: in both versions the colleague tends to advise how to phrase the email, skirting Refusal-to-Draft and partly defeating the measurement. Tabled for a follow-up.
The experiment is n=1 per archetype/version; pass rates aren't statistically distinguishable — repeated runs would give error bars.

🤖 Generated with Claude Code

- judge.ts: normalize CRLF when parsing criteria.md so loadCriteria works on Windows checkouts. Previously it returned 0 criteria, so every judgment and probe passed vacuously. - judge.ts: exclude _probes.json from conversation-log discovery and skip files without a messages array. Previously judging crashed mid-run when a probe results file already existed. - probe.ts: a probe now fails unless every targeted criterion was actually judged, so an unresolved criterion can no longer pass vacuously on an empty verdict list. - simulate.ts / probe.ts / judge.ts: preserve the colleague's verbatim JSON output (`raw`) and judge Response Format Compliance against it instead of the harness's joined plaintext, which the criterion could never satisfy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a rule to both scenarios' colleague system prompts so open-ended questions ("anything else?", "what's going on?", "tell me everything") get deflected to a specific question instead of triggering an information dump. Specific questions are still answered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Supervisor-facing writeup comparing crafted behavior rules vs. injecting the raw criteria.md into the colleague prompt (a tie at ~98% pass rate), with method, results, failure analysis, caveats, the colleague prompts used, and the pipeline bugs found and fixed during the work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

neh8 and others added 3 commits June 29, 2026 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502

Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502
neh8 wants to merge 3 commits into
mainfrom
claude/scenario-eval-pipeline-fixes

neh8 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

neh8 commented Jun 29, 2026

Summary

Pipeline bug fixes — Fix and harden scenario_design eval pipeline

Colleague behavior — Strengthen colleague information gating against vague questions

Experiment — Document criteria-injection A/B experiment

Verification

Notes

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant