fix(agent): retry judge schema validation with prettified errors by caffeinum · Pull Request #37 · webllm/browser-use

caffeinum · 2026-05-07T21:24:28Z

Problem

bu-2-0 occasionally omits required boolean fields (is_correct, verdict) from judge structured outputs entirely. Python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws — and operators never learn the model misbehaved.

Observed on a real eval run (codesandbox.com getting-started, 2026-05-07):

WARNING [browser_use.Agent] Simple judge failed with error: [
  {
    "expected": "boolean",
    "code": "invalid_type",
    "path": [ "is_correct" ],
    "message": "Invalid input: expected boolean, received undefined"
  }
]

The previous attempt in this branch defaulted missing booleans to false via a lenientBool helper. That hid the model bug from operators and made it impossible to distinguish "model said false" from "model said nothing."

Approach

Replace the silent default with retry-with-feedback, matching the pattern from PR #34 for action-emission retries:

When SimpleJudgeSchema or JudgeSchema fails to parse, send the prettified zod errors back to the LLM and retry up to 2 times.
If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up:
- _run_simple_judge marks the run as failed with a [Judge schema invalid: ...] note.
- _judge_trace synthesizes a verdict=false judgement with the schema error in failure_reason.
Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw.
A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked component tests), so we don't regress wired-mock tests.

This pairs strict zod with feedback-driven self-correction instead of papering over the model bug with a default.

Tests

test/agent-judge-schema-retry.test.ts covers:

bu-2-0 missing-is_correct retry-then-recover
retries exhaust → run marked failed
network errors stay swallowed (unchanged behavior)
JudgeSchema verdict-missing exhaustion path
verdict self-correction on retry

Diff scope

3 files, src + tests only:

src/agent/service.ts (retry loop + judge schemas)
src/exceptions.ts (new JudgeSchemaInvalidError)
test/agent-judge-schema-retry.test.ts

Supersedes the earlier closed iteration of this branch (PR #36).

…d throw on exhaust Replaces the prior lenientBool default-false approach which silently masked bu-2-0's tendency to emit undefined for `is_correct` and `verdict` boolean fields. Defaulting to false hid the model bug from operators and left the orchestrator unable to distinguish "model emitted false" from "model emitted nothing". Now: when SimpleJudgeSchema or JudgeSchema fails parse, we send the prettified zod errors back to the LLM and retry up to 2 times, matching PR webllm#34's pattern for action-emission retries. If retries exhaust, surface the failure on the run's final ActionResult so harbor's failure_reason picks it up — `_run_simple_judge` marks the run as failed with a `[Judge schema invalid: ...]` note; `_judge_trace` synthesizes a verdict=false judgement with the schema error in `failure_reason`. Adds JudgeSchemaInvalidError (src/exceptions.ts) for the internal throw. A first-attempt shape check (any judge-related key present) preserves the prior graceful-skip path when the LLM returns a non-judge JSON shape entirely (e.g. an agent-step JSON in mocked tests), so we don't regress component tests that wire one mock LLM for both agent and judge calls. This pairs strict zod with feedback-driven self-correction (per reference_zod_pydantic_parity.md) instead of papering over the model bug with a default. Adds test/agent-judge-schema-retry.test.ts covering: (1) bu-2-0 missing-is_correct retry-then-recover, (2) retries exhaust → run marked failed, (3) network errors stay swallowed, (4) JudgeSchema verdict-missing exhaustion path, (5) verdict self-correction on retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

caffeinum · 2026-05-07T21:32:47Z

Closing — reverting the silent-skip-on-shape-mismatch logic. Will reopen with cleaner version.

caffeinum requested a review from unadlib as a code owner May 7, 2026 21:24

caffeinum force-pushed the fix/lenient-bool-undefined branch from 9d788ca to 925339d Compare May 7, 2026 21:25

caffeinum closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): retry judge schema validation with prettified errors#37

fix(agent): retry judge schema validation with prettified errors#37
caffeinum wants to merge 1 commit into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined

caffeinum commented May 7, 2026

Uh oh!

caffeinum commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

caffeinum commented May 7, 2026

Problem

Approach

Tests

Diff scope

Uh oh!

caffeinum commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant