Add ClawBench to GUI Agent benchmarks by reacher-z · Pull Request #1 · dataanswer/awesome-agent-benchmarks

reacher-z · 2026-05-20T22:39:08Z

Adds ClawBench to the GUI Agent table.

ClawBench evaluates browser agents on live production websites (real Uber Eats, Indeed, Craigslist, etc., not Docker mocks). Two-stage scoring: a deterministic HTTP-request interception check at the per-task URL/method schema, then an LLM judge on the intercepted payload — so an agent that hits the right endpoint but submits the wrong thing fails.

Paper: arXiv:2604.08523
Tasks: V1 = 153 across 144 sites · V2 = 130 across 63 platforms · 15 life categories
Code: https://github.com/reacher-z/ClawBench · Live leaderboard: https://claw-bench.com
HF datasets: tasks · traces V1 · traces V2

Sits next to WebArena / Mind2Web / WorkArena in the table — distinguished by live (not Docker-hosted) sites + payload-correctness scoring (not just element/action match).

Affiliation disclosure: I'm one of the maintainers; happy to adjust columns/copy or drop entirely.

Add ClawBench to GUI Agent benchmarks

49b4704

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClawBench to GUI Agent benchmarks#1

Add ClawBench to GUI Agent benchmarks#1
reacher-z wants to merge 1 commit into
dataanswer:mainfrom
reacher-z:add-clawbench

reacher-z commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reacher-z commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant