Skip to content

fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36

Closed
caffeinum wants to merge 14 commits into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined
Closed

fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36
caffeinum wants to merge 14 commits into
webllm:mainfrom
caffeinum:fix/lenient-bool-undefined

Conversation

@caffeinum
Copy link
Copy Markdown
Contributor

Problem

bu-2-0 occasionally omits required boolean fields from structured outputs entirely. Python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws.

Observed on a real eval run (codesandbox.com getting-started, 2026-05-07):

WARNING [browser_use.Agent] Simple judge failed with error: [
  {
    "expected": "boolean",
    "code": "invalid_type",
    "path": [ "is_correct" ],
    "message": "Invalid input: expected boolean, received undefined"
  }
]

This crashed _run_simple_judge and skipped the success-override gate. The same pattern can hit _judge_trace's verdict field.

Fix

Wrap the two strict required booleans (SimpleJudgeSchema.is_correct and JudgeSchema.verdict) with a lenientBool preprocessor:

  • undefined / null → default false (judge declines to confirm pass — fail-safe)
  • "true" / "false" strings → real booleans
  • everything else passes through unchanged

All other booleans in the codebase already use .default() or .optional(), so they already accept undefined — only these two strict required fields need the helper.

Why default false

The simple-judge and trace-judge both gate success → failure overrides. When the model can't commit, treating the run as not-yet-correct is the conservative choice: an unconfirmed pass becomes a fail rather than letting a bogus success slip through.

Pairs with

Notes

  • Includes prebuilt dist/ per existing fork install pattern
  • Typecheck clean, all 971 unit tests pass

caffeinum and others added 14 commits May 5, 2026 17:56
…f dump

RegisteredAction.promptDescription previously serialized each property of
the zod object schema by stringifying its private `_def` AST. For schemas
with `.default()` wrappers (e.g. ScrollActionSchema's `down` and
`num_pages`), the LLM would see something like:

  "num_pages": {"type":"default","innerType":{"def":{"type":"number"},...},"defaultValue":1},
  "down":      {"type":"default","innerType":{"def":{"type":"boolean"},...},"defaultValue":true}

The model would plausibly copy the nearby `defaultValue: true` and emit a
boolean for `num_pages`. The schema correctly rejected, the same prompt
was fed back, and the same mistake recurred until `max_failures=3` tripped.

Replace the `_def` walk with `z.toJSONSchema(schema, {unrepresentable:'any'})`
(zod v4 native), strip the `$schema` dialect URL, and apply the existing
skipKeys filter to both `properties` and `required`. The LLM now sees:

  {"type":"object","properties":{"down":{"default":true,"type":"boolean"},
   "num_pages":{"default":1,"type":"number"},...}, "required":[...],
   "additionalProperties":false}

— a familiar, well-known JSON Schema shape with no zod-internal leakage.

Surrounding `${description}: \n{${name}: ...}` envelope is unchanged so the
LLM sees the same outer layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etry

When `_validateAndNormalizeActions` rejected an action's params via
`actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message
was the raw `paramsResult.error.message` — i.e. zod v4's default JSON
dump of the `issues` array (`[{"expected":"number","code":"invalid_type",
"path":["num_pages"],"message":"Invalid input: expected number, received
boolean"}]`). This noisy blob did flow into `state.last_result` and into
the next `create_state_messages` turn, but it was hard for the model to
parse and gave no corrective hint, so the model retried with the same
mistake until `max_failures=3` tripped.

Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render
issues as readable lines (e.g. `✖ Invalid input: expected number,
received boolean → at num_pages`), include the offending params verbatim
so the model can diff against the schema, and tag the message with an
explicit `Schema validation failed` prefix plus a `Please retry with
parameters matching the action's schema exactly` instruction.

The existing pipeline does the rest: thrown Error → `_handle_step_error`
→ `state.last_result = [ActionResult({error: ...})]` → next step's
`create_state_messages` injects it into the LLM context. No new
injection mechanism, just a better-shaped payload going through the
existing one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So consumers can `npm install github:caffeinum/browser-use` (no #ref) and get
working code without needing a `prepare` script + devDeps + pnpm in their
build environment. Source of truth is still src/; dist/ should be rebuilt
and re-committed on top of any new src/ change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports the equivalent of upstream Python PR #3793 to the TS in-page DOM
walker. The Python fix targeted CDP DOMSnapshot omitting layout for shadow
DOM nodes; the TS port uses live DOM and gates inclusion on isElementVisible
(offsetWidth/Height > 0). Custom-styled shadow widgets (e.g. auth0 login)
can still report 0 dimensions on the inner control while being functional.

For input/button/select/textarea/a inside a shadow root, treat the element
as visible+top so isInteractiveElement runs and it lands in selector_map.
The browser-use provider auto-flips flash_mode (service.ts:599), which
selects the 15-line minimal prompt. That prompt lacked any guidance for
loop avoidance, retry strategy, or autocomplete handling — gaps the
fine-tuned bu-2-0 model was supposed to internalize but doesn't always.

Append a short <retry_strategy> block (5 generic rules, no URL or
provider matching) covering: same-action-3x, stuck-URL, dead clicks,
autocomplete value mismatch, and missing credentials. Keeps the prompt
small (15 -> 23 lines) without re-inflating to the full 269-line variant.
…arity)

Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).

This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.

This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.

Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hand-captured request/response pair from a real BA invocation against the
daytona.io auth0 login page. The model emits {input_text: {index: true}}
which silently passes pydantic in upstream python (True -> 1) but hard-fails
zod in this TS port. Useful for upstream model-bug reporting and as test
data for the boolean->int coercion fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt coercion

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bu-2-0 emits {click: {index: false}} when targeting the first element
on a page (index 0). lenientInt coerces false → 0, but the previous
lenientInt(1) min check rejected 0. Result: BA bailed in 11 consecutive
validation failures every time the target was index 0 (e.g. GitHub
OAuth Continue button).

Index 0 is a valid DOM element index per the documented schema
(`index: int >= 0` upstream). Drop min=1 to lenientInt(0) for:
- ClickElementActionSchema
- ClickElementActionIndexOnlySchema
- DropdownOptionsActionSchema
- SelectDropdownActionSchema

Update tests in coerce-boolean-to-int.test.ts and controller.test.ts
to assert acceptance of index=0 instead of the previous reject.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most clicks don't trigger a file download. The unawaited
page.waitForEvent('download', { timeout: 5000 }) rejects on timeout,
escaping as an unhandledRejection that kills the whole watcher
process. Attach a no-op .catch() so the timeout is treated as
"no download, proceed" — matching the existing perform_click
behavior at the second download-wait call site (parity, not a new
heuristic).

Repro: any SaaS dashboard with nested Stripe iframes (e.g.
browser-use.com/settings) where a side-nav click is heuristically
flagged as download-capable but is actually a route nav. Crash
visible in queue logs as
"unhandledRejection: page.waitForEvent: Timeout 5000ms exceeded".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bu-2-0 occasionally omits required boolean fields entirely. python upstream
silently defaults them to false via pydantic's lax mode; zod hard-rejects
and the structured-output parse throws.

Observed on codesandbox.com run 2026-05-07__20-17-53__codesandbox-com-getting-started__f70cb0:
  Invalid input: expected boolean, received undefined  (path: is_correct)
which crashed _run_simple_judge twice and skipped the success-override gate.

Wrap the two strict required booleans (SimpleJudgeSchema.is_correct and
JudgeSchema.verdict) with the lenientBool helper introduced for bu-ts.
undefined/null -> false (judge declines to confirm pass), "true"/"false"
strings -> real bools. All other booleans in the codebase already use
.default() or .optional() so undefined was already handled there.

Pairs with PR webllm#34 (zod-error-feedback) so the LLM can self-correct on retry.

Refs: webllm/browser-use parity for SimpleJudgeSchema / JudgeSchema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant