fix(agent): coerce missing booleans in judge schemas (pydantic parity) by caffeinum · Pull Request #36 · webllm/browser-use

caffeinum · 2026-05-07T20:35:10Z

Problem

bu-2-0 occasionally omits required boolean fields from structured outputs entirely. Python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws.

Observed on a real eval run (codesandbox.com getting-started, 2026-05-07):

WARNING [browser_use.Agent] Simple judge failed with error: [
  {
    "expected": "boolean",
    "code": "invalid_type",
    "path": [ "is_correct" ],
    "message": "Invalid input: expected boolean, received undefined"
  }
]

This crashed _run_simple_judge and skipped the success-override gate. The same pattern can hit _judge_trace's verdict field.

Fix

Wrap the two strict required booleans (SimpleJudgeSchema.is_correct and JudgeSchema.verdict) with a lenientBool preprocessor:

undefined / null → default false (judge declines to confirm pass — fail-safe)
"true" / "false" strings → real booleans
everything else passes through unchanged

All other booleans in the codebase already use .default() or .optional(), so they already accept undefined — only these two strict required fields need the helper.

Why default `false`

The simple-judge and trace-judge both gate success → failure overrides. When the model can't commit, treating the run as not-yet-correct is the conservative choice: an unconfirmed pass becomes a fail rather than letting a bogus success slip through.

Pairs with

fix(agent): feed prettified zod issues + sent params back to LLM on retry #34 (zod-error-feedback, merged) — feeds zod issues back to the LLM so it can self-correct on retry
fix(controller): coerce booleans to ints for action index fields (pydantic parity) #35 (int coercion) — same pydantic-parity pattern for action index fields

Notes

Includes prebuilt dist/ per existing fork install pattern
Typecheck clean, all 971 unit tests pass

…f dump RegisteredAction.promptDescription previously serialized each property of the zod object schema by stringifying its private `_def` AST. For schemas with `.default()` wrappers (e.g. ScrollActionSchema's `down` and `num_pages`), the LLM would see something like: "num_pages": {"type":"default","innerType":{"def":{"type":"number"},...},"defaultValue":1}, "down": {"type":"default","innerType":{"def":{"type":"boolean"},...},"defaultValue":true} The model would plausibly copy the nearby `defaultValue: true` and emit a boolean for `num_pages`. The schema correctly rejected, the same prompt was fed back, and the same mistake recurred until `max_failures=3` tripped. Replace the `_def` walk with `z.toJSONSchema(schema, {unrepresentable:'any'})` (zod v4 native), strip the `$schema` dialect URL, and apply the existing skipKeys filter to both `properties` and `required`. The LLM now sees: {"type":"object","properties":{"down":{"default":true,"type":"boolean"}, "num_pages":{"default":1,"type":"number"},...}, "required":[...], "additionalProperties":false} — a familiar, well-known JSON Schema shape with no zod-internal leakage. Surrounding `${description}: \n{${name}: ...}` envelope is unchanged so the LLM sees the same outer layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…etry When `_validateAndNormalizeActions` rejected an action's params via `actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message was the raw `paramsResult.error.message` — i.e. zod v4's default JSON dump of the `issues` array (`[{"expected":"number","code":"invalid_type", "path":["num_pages"],"message":"Invalid input: expected number, received boolean"}]`). This noisy blob did flow into `state.last_result` and into the next `create_state_messages` turn, but it was hard for the model to parse and gave no corrective hint, so the model retried with the same mistake until `max_failures=3` tripped. Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render issues as readable lines (e.g. `✖ Invalid input: expected number, received boolean → at num_pages`), include the offending params verbatim so the model can diff against the schema, and tag the message with an explicit `Schema validation failed` prefix plus a `Please retry with parameters matching the action's schema exactly` instruction. The existing pipeline does the rest: thrown Error → `_handle_step_error` → `state.last_result = [ActionResult({error: ...})]` → next step's `create_state_messages` injects it into the LLM context. No new injection mechanism, just a better-shaped payload going through the existing one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

So consumers can `npm install github:caffeinum/browser-use` (no #ref) and get working code without needing a `prepare` script + devDeps + pnpm in their build environment. Source of truth is still src/; dist/ should be rebuilt and re-committed on top of any new src/ change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports the equivalent of upstream Python PR #3793 to the TS in-page DOM walker. The Python fix targeted CDP DOMSnapshot omitting layout for shadow DOM nodes; the TS port uses live DOM and gates inclusion on isElementVisible (offsetWidth/Height > 0). Custom-styled shadow widgets (e.g. auth0 login) can still report 0 dimensions on the inner control while being functional. For input/button/select/textarea/a inside a shadow root, treat the element as visible+top so isInteractiveElement runs and it lands in selector_map.

The browser-use provider auto-flips flash_mode (service.ts:599), which selects the 15-line minimal prompt. That prompt lacked any guidance for loop avoidance, retry strategy, or autocomplete handling — gaps the fine-tuned bu-2-0 model was supposed to internalize but doesn't always. Append a short <retry_strategy> block (5 generic rules, no URL or provider matching) covering: same-action-3x, stuck-URL, dead clicks, autocomplete value mismatch, and missing credentials. Keeps the prompt small (15 -> 23 lines) without re-inflating to the full 269-line variant.

…arity) Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits `{input_text: {index: true, text: "..."}}` (boolean) where the schema expects an integer. python upstream silently coerces `True -> 1` / `False -> 0` via pydantic's default lax mode. zod (TS) hard-rejects with `expected number, received boolean`, the agent retries with the same broken output, and bails at max_failures. observed bail mode in production for every auth0-style form fill (daytona, zeroentropy, kernel, browserbase). This patch ports the lax-coercion behavior to TS at the validation boundary. A `lenientInt(min)` helper preprocesses booleans into numbers before delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()` covering `num_pages` / `pages` floats. Helper is applied only to LLM-emitted index/element-index/page-count fields where pydantic's silent coercion is documented behavior. Fields where bool->0/1 would be semantically wrong (timeout, delay, max_results, coordinate_x/y) are left strict to avoid masking a different model bug. This is a graceful-degradation patch, not a fix to the model. bu-2-0 should not emit booleans for integer fields. With this patch the agent now progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on bad index choice rather than looping to max_failures. Helpers + per-schema regression coverage in `test/coerce-boolean-to-int.test.ts` (23 tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Hand-captured request/response pair from a real BA invocation against the daytona.io auth0 login page. The model emits {input_text: {index: true}} which silently passes pydantic in upstream python (True -> 1) but hard-fails zod in this TS port. Useful for upstream model-bug reporting and as test data for the boolean->int coercion fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nt coercion Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bu-2-0 emits {click: {index: false}} when targeting the first element on a page (index 0). lenientInt coerces false → 0, but the previous lenientInt(1) min check rejected 0. Result: BA bailed in 11 consecutive validation failures every time the target was index 0 (e.g. GitHub OAuth Continue button). Index 0 is a valid DOM element index per the documented schema (`index: int >= 0` upstream). Drop min=1 to lenientInt(0) for: - ClickElementActionSchema - ClickElementActionIndexOnlySchema - DropdownOptionsActionSchema - SelectDropdownActionSchema Update tests in coerce-boolean-to-int.test.ts and controller.test.ts to assert acceptance of index=0 instead of the previous reject. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Most clicks don't trigger a file download. The unawaited page.waitForEvent('download', { timeout: 5000 }) rejects on timeout, escaping as an unhandledRejection that kills the whole watcher process. Attach a no-op .catch() so the timeout is treated as "no download, proceed" — matching the existing perform_click behavior at the second download-wait call site (parity, not a new heuristic). Repro: any SaaS dashboard with nested Stripe iframes (e.g. browser-use.com/settings) where a side-nav click is heuristically flagged as download-capable but is actually a route nav. Crash visible in queue logs as "unhandledRejection: page.waitForEvent: Timeout 5000ms exceeded". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bu-2-0 occasionally omits required boolean fields entirely. python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws. Observed on codesandbox.com run 2026-05-07__20-17-53__codesandbox-com-getting-started__f70cb0: Invalid input: expected boolean, received undefined (path: is_correct) which crashed _run_simple_judge twice and skipped the success-override gate. Wrap the two strict required booleans (SimpleJudgeSchema.is_correct and JudgeSchema.verdict) with the lenientBool helper introduced for bu-ts. undefined/null -> false (judge declines to confirm pass), "true"/"false" strings -> real bools. All other booleans in the codebase already use .default() or .optional() so undefined was already handled there. Pairs with PR webllm#34 (zod-error-feedback) so the LLM can self-correct on retry. Refs: webllm/browser-use parity for SimpleJudgeSchema / JudgeSchema

caffeinum and others added 14 commits May 5, 2026 17:56

chore(dist): rebuild dist after merging shadow DOM + flash prompt + i…

d13dc00

…nt coercion Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(dist): rebuild dist for click index=0 fix

4713d6a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(dist): rebuild dist for click download timeout fix

c351fb0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(dist): rebuild dist for lenient-bool judge schema fix

df0ceda

caffeinum closed this May 7, 2026

caffeinum mentioned this pull request May 7, 2026

fix(agent): retry judge schema validation with prettified errors #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36

fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36
caffeinum wants to merge 14 commits into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined

caffeinum commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

caffeinum commented May 7, 2026

Problem

Fix

Why default false

Pairs with

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why default `false`