Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502
Open
neh8 wants to merge 3 commits into
Open
Fix & harden scenario_design eval pipeline, strengthen colleague info-gating (+ criteria-injection experiment)#502neh8 wants to merge 3 commits into
neh8 wants to merge 3 commits into
Conversation
- judge.ts: normalize CRLF when parsing criteria.md so loadCriteria works on Windows checkouts. Previously it returned 0 criteria, so every judgment and probe passed vacuously. - judge.ts: exclude _probes.json from conversation-log discovery and skip files without a messages array. Previously judging crashed mid-run when a probe results file already existed. - probe.ts: a probe now fails unless every targeted criterion was actually judged, so an unresolved criterion can no longer pass vacuously on an empty verdict list. - simulate.ts / probe.ts / judge.ts: preserve the colleague's verbatim JSON output (`raw`) and judge Response Format Compliance against it instead of the harness's joined plaintext, which the criterion could never satisfy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a rule to both scenarios' colleague system prompts so open-ended questions
("anything else?", "what's going on?", "tell me everything") get deflected to a
specific question instead of triggering an information dump. Specific questions
are still answered.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Supervisor-facing writeup comparing crafted behavior rules vs. injecting the raw criteria.md into the colleague prompt (a tie at ~98% pass rate), with method, results, failure analysis, caveats, the colleague prompts used, and the pipeline bugs found and fixed during the work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes and hardens the
experiment/scripts/scenario_designcolleague-evaluation pipeline, strengthens the colleague's information-gating behavior, and adds a written-up A/B experiment. Three commits; the diff vsmainis exactly this work.Pipeline bug fixes — Fix and harden scenario_design eval pipeline
The eval scripts were silently not working on Windows checkouts:
loadCriteria()splitcriteria.mdon\n, leaving a trailing\rthat broke the title regex, so it returned 0 criteria — every judgment and probe passed vacuously. Now normalizes CRLF.simulate.ts/probe.tsjoined the colleague's JSON array into plaintext before judging, so criterion 8 could never pass. The colleague's verbatim output is now preserved (raw) and judged directly (seed messages included)._probes.json, so judging crashed mid-run (parsing an array as a conversation). Filter now excludes it, with a guard for non-log files.Colleague behavior — Strengthen colleague information gating against vague questions
Vague/over-broad questions ("anything else?", "what's going on?", "tell me everything") were making the colleague dump key facts — an Information Gating violation and a confound for the study's process measure. Both scenarios' prompts now deflect open-ended questions to specifics while still answering direct questions. Verified: the "Vague follow-up" probe and the
vague/offloaderarchetypes now pass info-gating on both scenarios.Experiment — Document criteria-injection A/B experiment
Tested whether putting the passing criteria directly into the colleague prompt (replacing the crafted persona rules with the 9 raw
criteria.mdrules) changes results. Result: a tie — both versions ~98% multi-turn / 100% probes across both scenarios. Failure modes differ (baseline's substantive miss is Refusal-to-Draft/content-coaching; the criteria-injected version's only misses are Response Format). Full writeup, prompts, and caveats inexperiment/scripts/scenario_design/experiments/criteria_injection_experiment.md.Verification
vague/offloaderarchetypes 9/9 criteria.npx tsc --noEmitclean for the scenario_design scripts.Notes
scenarios.json) is the info-gating version; no criteria-injected content was committed to it.outputs/+ a_-prefixed scratch file).Out of scope / follow-ups
🤖 Generated with Claude Code