fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36
Closed
caffeinum wants to merge 14 commits into
Closed
fix(agent): coerce missing booleans in judge schemas (pydantic parity)#36caffeinum wants to merge 14 commits into
caffeinum wants to merge 14 commits into
Conversation
…f dump
RegisteredAction.promptDescription previously serialized each property of
the zod object schema by stringifying its private `_def` AST. For schemas
with `.default()` wrappers (e.g. ScrollActionSchema's `down` and
`num_pages`), the LLM would see something like:
"num_pages": {"type":"default","innerType":{"def":{"type":"number"},...},"defaultValue":1},
"down": {"type":"default","innerType":{"def":{"type":"boolean"},...},"defaultValue":true}
The model would plausibly copy the nearby `defaultValue: true` and emit a
boolean for `num_pages`. The schema correctly rejected, the same prompt
was fed back, and the same mistake recurred until `max_failures=3` tripped.
Replace the `_def` walk with `z.toJSONSchema(schema, {unrepresentable:'any'})`
(zod v4 native), strip the `$schema` dialect URL, and apply the existing
skipKeys filter to both `properties` and `required`. The LLM now sees:
{"type":"object","properties":{"down":{"default":true,"type":"boolean"},
"num_pages":{"default":1,"type":"number"},...}, "required":[...],
"additionalProperties":false}
— a familiar, well-known JSON Schema shape with no zod-internal leakage.
Surrounding `${description}: \n{${name}: ...}` envelope is unchanged so the
LLM sees the same outer layout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etry
When `_validateAndNormalizeActions` rejected an action's params via
`actionInfo.paramSchema.safeParse(rawParams)`, the thrown error message
was the raw `paramsResult.error.message` — i.e. zod v4's default JSON
dump of the `issues` array (`[{"expected":"number","code":"invalid_type",
"path":["num_pages"],"message":"Invalid input: expected number, received
boolean"}]`). This noisy blob did flow into `state.last_result` and into
the next `create_state_messages` turn, but it was hard for the model to
parse and gave no corrective hint, so the model retried with the same
mistake until `max_failures=3` tripped.
Use `z.prettifyError(paramsResult.error)` (zod v4 native) to render
issues as readable lines (e.g. `✖ Invalid input: expected number,
received boolean → at num_pages`), include the offending params verbatim
so the model can diff against the schema, and tag the message with an
explicit `Schema validation failed` prefix plus a `Please retry with
parameters matching the action's schema exactly` instruction.
The existing pipeline does the rest: thrown Error → `_handle_step_error`
→ `state.last_result = [ActionResult({error: ...})]` → next step's
`create_state_messages` injects it into the LLM context. No new
injection mechanism, just a better-shaped payload going through the
existing one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So consumers can `npm install github:caffeinum/browser-use` (no #ref) and get working code without needing a `prepare` script + devDeps + pnpm in their build environment. Source of truth is still src/; dist/ should be rebuilt and re-committed on top of any new src/ change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports the equivalent of upstream Python PR #3793 to the TS in-page DOM walker. The Python fix targeted CDP DOMSnapshot omitting layout for shadow DOM nodes; the TS port uses live DOM and gates inclusion on isElementVisible (offsetWidth/Height > 0). Custom-styled shadow widgets (e.g. auth0 login) can still report 0 dimensions on the inner control while being functional. For input/button/select/textarea/a inside a shadow root, treat the element as visible+top so isInteractiveElement runs and it lands in selector_map.
The browser-use provider auto-flips flash_mode (service.ts:599), which selects the 15-line minimal prompt. That prompt lacked any guidance for loop avoidance, retry strategy, or autocomplete handling — gaps the fine-tuned bu-2-0 model was supposed to internalize but doesn't always. Append a short <retry_strategy> block (5 generic rules, no URL or provider matching) covering: same-action-3x, stuck-URL, dead clicks, autocomplete value mismatch, and missing credentials. Keeps the prompt small (15 -> 23 lines) without re-inflating to the full 269-line variant.
…arity)
Empirical observation: bu-2-0 (browser-use cloud LLM) occasionally emits
`{input_text: {index: true, text: "..."}}` (boolean) where the schema expects
an integer. python upstream silently coerces `True -> 1` / `False -> 0` via
pydantic's default lax mode. zod (TS) hard-rejects with
`expected number, received boolean`, the agent retries with the same broken
output, and bails at max_failures. observed bail mode in production for every
auth0-style form fill (daytona, zeroentropy, kernel, browserbase).
This patch ports the lax-coercion behavior to TS at the validation boundary.
A `lenientInt(min)` helper preprocesses booleans into numbers before
delegating to `z.number().int().min(min)`. Same shape for `lenientNumber()`
covering `num_pages` / `pages` floats. Helper is applied only to
LLM-emitted index/element-index/page-count fields where pydantic's silent
coercion is documented behavior. Fields where bool->0/1 would be
semantically wrong (timeout, delay, max_results, coordinate_x/y) are left
strict to avoid masking a different model bug.
This is a graceful-degradation patch, not a fix to the model. bu-2-0 should
not emit booleans for integer fields. With this patch the agent now
progresses + relies on existing retry-feedback (PR webllm#34) to self-correct on
bad index choice rather than looping to max_failures.
Helpers + per-schema regression coverage in
`test/coerce-boolean-to-int.test.ts` (23 tests).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hand-captured request/response pair from a real BA invocation against the
daytona.io auth0 login page. The model emits {input_text: {index: true}}
which silently passes pydantic in upstream python (True -> 1) but hard-fails
zod in this TS port. Useful for upstream model-bug reporting and as test
data for the boolean->int coercion fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt coercion Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bu-2-0 emits {click: {index: false}} when targeting the first element
on a page (index 0). lenientInt coerces false → 0, but the previous
lenientInt(1) min check rejected 0. Result: BA bailed in 11 consecutive
validation failures every time the target was index 0 (e.g. GitHub
OAuth Continue button).
Index 0 is a valid DOM element index per the documented schema
(`index: int >= 0` upstream). Drop min=1 to lenientInt(0) for:
- ClickElementActionSchema
- ClickElementActionIndexOnlySchema
- DropdownOptionsActionSchema
- SelectDropdownActionSchema
Update tests in coerce-boolean-to-int.test.ts and controller.test.ts
to assert acceptance of index=0 instead of the previous reject.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most clicks don't trigger a file download. The unawaited
page.waitForEvent('download', { timeout: 5000 }) rejects on timeout,
escaping as an unhandledRejection that kills the whole watcher
process. Attach a no-op .catch() so the timeout is treated as
"no download, proceed" — matching the existing perform_click
behavior at the second download-wait call site (parity, not a new
heuristic).
Repro: any SaaS dashboard with nested Stripe iframes (e.g.
browser-use.com/settings) where a side-nav click is heuristically
flagged as download-capable but is actually a route nav. Crash
visible in queue logs as
"unhandledRejection: page.waitForEvent: Timeout 5000ms exceeded".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bu-2-0 occasionally omits required boolean fields entirely. python upstream silently defaults them to false via pydantic's lax mode; zod hard-rejects and the structured-output parse throws. Observed on codesandbox.com run 2026-05-07__20-17-53__codesandbox-com-getting-started__f70cb0: Invalid input: expected boolean, received undefined (path: is_correct) which crashed _run_simple_judge twice and skipped the success-override gate. Wrap the two strict required booleans (SimpleJudgeSchema.is_correct and JudgeSchema.verdict) with the lenientBool helper introduced for bu-ts. undefined/null -> false (judge declines to confirm pass), "true"/"false" strings -> real bools. All other booleans in the codebase already use .default() or .optional() so undefined was already handled there. Pairs with PR webllm#34 (zod-error-feedback) so the LLM can self-correct on retry. Refs: webllm/browser-use parity for SimpleJudgeSchema / JudgeSchema
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
bu-2-0 occasionally omits required boolean fields from structured outputs entirely. Python upstream silently defaults them to
falsevia pydantic's lax mode; zod hard-rejects and the structured-output parse throws.Observed on a real eval run (codesandbox.com getting-started, 2026-05-07):
This crashed
_run_simple_judgeand skipped the success-override gate. The same pattern can hit_judge_trace'sverdictfield.Fix
Wrap the two strict required booleans (
SimpleJudgeSchema.is_correctandJudgeSchema.verdict) with alenientBoolpreprocessor:undefined/null→ defaultfalse(judge declines to confirm pass — fail-safe)"true"/"false"strings → real booleansAll other booleans in the codebase already use
.default()or.optional(), so they already acceptundefined— only these two strict required fields need the helper.Why default
falseThe simple-judge and trace-judge both gate
success → failureoverrides. When the model can't commit, treating the run as not-yet-correct is the conservative choice: an unconfirmed pass becomes a fail rather than letting a bogus success slip through.Pairs with
indexfieldsNotes
dist/per existing fork install pattern