Skip to content

fix(agent): retry judge schema validation with prettified errors#37

Closed
caffeinum wants to merge 1 commit into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined
Closed

fix(agent): retry judge schema validation with prettified errors#37
caffeinum wants to merge 1 commit into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined

Conversation

@caffeinum
Copy link
Copy Markdown
Contributor

Problem

bu-2-0 occasionally omits required boolean fields (is_correct, verdict) from judge structured outputs entirely. Python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws — and operators never learn the model misbehaved.

Observed on a real eval run (codesandbox.com getting-started, 2026-05-07):

WARNING [browser_use.Agent] Simple judge failed with error: [
  {
    "expected": "boolean",
    "code": "invalid_type",
    "path": [ "is_correct" ],
    "message": "Invalid input: expected boolean, received undefined"
  }
]

The previous attempt in this branch defaulted missing booleans to false via a lenientBool helper. That hid the model bug from operators and made it impossible to distinguish "model said false" from "model said nothing."

Approach

Replace the silent default with retry-with-feedback, matching the pattern from PR #34 for action-emission retries:

  1. When SimpleJudgeSchema or JudgeSchema fails to parse, send the prettified zod errors back to the LLM and retry up to 2 times.
  2. If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up:
    • _run_simple_judge marks the run as failed with a [Judge schema invalid: ...] note.
    • _judge_trace synthesizes a verdict=false judgement with the schema error in failure_reason.
  3. Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw.
  4. A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked component tests), so we don't regress wired-mock tests.

This pairs strict zod with feedback-driven self-correction instead of papering over the model bug with a default.

Tests

test/agent-judge-schema-retry.test.ts covers:

  1. bu-2-0 missing-is_correct retry-then-recover
  2. retries exhaust → run marked failed
  3. network errors stay swallowed (unchanged behavior)
  4. JudgeSchema verdict-missing exhaustion path
  5. verdict self-correction on retry

Diff scope

3 files, src + tests only:

  • src/agent/service.ts (retry loop + judge schemas)
  • src/exceptions.ts (new JudgeSchemaInvalidError)
  • test/agent-judge-schema-retry.test.ts

Supersedes the earlier closed iteration of this branch (PR #36).

@caffeinum caffeinum requested a review from unadlib as a code owner May 7, 2026 21:24
…d throw on exhaust

Replaces the prior lenientBool default-false approach which silently
masked bu-2-0's tendency to emit undefined for `is_correct` and
`verdict` boolean fields. Defaulting to false hid the model bug from
operators and left the orchestrator unable to distinguish "model
emitted false" from "model emitted nothing".

Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the
prettified zod errors back to the LLM and retry up to 2 times,
matching PR webllm#34's pattern for action-emission retries. If retries
exhaust, surface the failure on the run's final ActionResult so
harbor's failure_reason picks it up — `_run_simple_judge` marks the
run as failed with a `[Judge schema invalid: ...]` note;
`_judge_trace` synthesizes a verdict=false judgement with the
schema error in `failure_reason`. Adds JudgeSchemaInvalidError
(src/exceptions.ts) for the internal throw.

A first-attempt shape check (any judge-related key present) preserves
the prior graceful-skip path when the LLM returns a non-judge JSON
shape entirely (e.g. an agent-step JSON in mocked tests), so we don't
regress component tests that wire one mock LLM for both agent and
judge calls.

This pairs strict zod with feedback-driven self-correction (per
reference_zod_pydantic_parity.md) instead of papering over the model
bug with a default. Adds test/agent-judge-schema-retry.test.ts
covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2)
retries exhaust → run marked failed, (3) network errors stay
swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5)
verdict self-correction on retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@caffeinum caffeinum force-pushed the fix/lenient-bool-undefined branch from 9d788ca to 925339d Compare May 7, 2026 21:25
@caffeinum
Copy link
Copy Markdown
Contributor Author

Closing — reverting the silent-skip-on-shape-mismatch logic. Will reopen with cleaner version.

@caffeinum caffeinum closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant