Two ways to solve the same admin task, measured head-to-head:
- Browser-use agent — an LLM driving a real browser through the UI.
- Structured-API agent — an LLM calling typed HTTP tools that back the same app.
Both target a Reflex port of the react-admin "Posters Galore" demo, so the comparison is one app, two interfaces — not two different apps.
"A customer named Smith has complained about a recent order. Find the Smith with the most orders, accept all their pending reviews, and mark their most-recent ordered order as delivered."
Validated against expected_outcome.json:
| Customer | Gary Smith (ID 421) |
| Order to mark delivered | #98 (ref 5WUJSYV5) |
| Reviews to accept | IDs 0, 49, 292, 293 |
Data is pinned in seed.json (900 customers, 600 orders, 324 reviews).
Cells show mean ± sample standard deviation (n−1) across n trials, with min–max ranges noted in the caveat below for context. Per-trial JSONs are in results/ (numbered files for the multi-trial groups, plus the original n=1 file from the first commit kept untouched). See Limitations and caveats for context.
| Run | Model | Vision | n | Time (s) | Reasoning units | Input tokens | Output tokens | Cache read | Outcome |
|---|---|---|---|---|---|---|---|---|---|
| API agent | Sonnet 4 | n/a | 5 | 19.7 ± 2.8 | 8 ± 0 tool calls (~14 HTTP) | 12,151 ± 27 | 934 ± 41 | n/a | 5/5 ✅ |
| API agent | Haiku 4.5 | n/a | 5 | 7.7 ± 0.5 | 8 ± 0 tool calls (~14 HTTP) | 9,478 ± 809 | 819 ± 52 | n/a | 5/5 ✅ |
| Browser agent | Sonnet 4 | yes | 3 | 1003 ± 254 | 53 ± 13 LLM cycles | 550,976 ± 178,849 | 37,962 ± 10,850 | 529,176 ± 127,502 | 3/3 ✅ |
| Browser agent | Haiku 4.5 | yes | 1 | 87.75 | 1 LLM cycle | 2,390 | 614 | 0 | ❌ no final result |
| Browser agent | Haiku 4.5 | no | 1 | 92.96 | 3 LLM cycles | 66 | 2,290 | 30,732 | ❌ no final result |
Reasoning units are not directly comparable across rows. An API tool call is one Anthropic request — but each tool maps to 1–3 HTTP requests against the Reflex backend (see run_api_agent.py). A browser-agent "LLM cycle" is one screenshot/DOM-reason-act loop in browser-use. For dollar-cost comparison, look at input/output tokens.
The browser-Sonnet group has wide spread: the longest run was 1,296 s / 68 cycles / 751k input tokens; the shortest was 853 s / 43 cycles / 407k input — nearly a 2× range across just 3 trials. The large standard deviations on every browser metric reflect this. API runs were tight by comparison: API Sonnet hit 8 tool calls every trial and varied ±27 input tokens.
benchmark/
├── seed.json pinned dataset (shared)
├── expected_outcome.json task success criteria
├── reflex-admin/ Reflex port of Posters Galore
│ ├── reflex_admin/
│ │ ├── reflex_admin.py app + page routes
│ │ ├── state.py Reflex state — UI handlers (also serve as API endpoints via the plugin)
│ │ ├── pages/ customers, orders, reviews
│ │ └── data.py in-memory datastore over seed.json
│ ├── run_api_agent.py API agent runner (tool-use)
│ ├── rxconfig.py rxe.Config + EventHandlerAPIPlugin
│ └── requirements.txt
├── browser-use-agent/
│ ├── run_browser_agent.py browser-use runner w/ token counting
│ ├── pyproject.toml
│ └── .env.example copy to .env, set ANTHROPIC_API_KEY
└── results/ benchmark output JSONs
rxe.EventHandlerAPIPlugin (configured in rxconfig.py) auto-generates an HTTP endpoint for every event handler on the State — set_customers_query, load_order, accept_review, etc. There is no API-specific code in reflex_admin/state.py: the same handlers that drive the UI also serve the API. This is the point of the benchmark — measuring an agent against Reflex's "free" API surface, not a hand-shaped REST layer.
Responses stream as NDJSON state deltas, including recomputed dependent computed vars (customer_rows, order_rows, selected_order, etc.). The agent's REST-shaped tool surface (list_customers, update_order, etc.) is mapped in run_api_agent.py to handler sequences — e.g. update_order is load_order → set_order_status_draft → save_order_status, mirroring the order detail page's UI flow.
The plugin is part of reflex-enterprise and requires rxe.App() / rxe.Config().
- Python 3.12
- Node.js +
bun(npm install -g bun) — Reflex needsbunfor its dev server ANTHROPIC_API_KEYinbrowser-use-agent/.env(copy from.env.example)
cd reflex-admin
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
reflex login # one-time, for reflex-enterprise
reflex init # only first time
reflex run # frontend :3001, backend :8001This serves:
- UI →
http://localhost:3001(for the browser-use agent) - API →
http://localhost:8001/_reflex/event/...(for the API agent) - OpenAPI spec →
http://localhost:8001/_reflex/events/openapi.yaml
API agent:
cd reflex-admin
source .venv/bin/activate
export ANTHROPIC_API_KEY=...
python run_api_agent.py --model claude-sonnet-4-20250514 --out ../results/api_sonnet.json
python run_api_agent.py --model claude-haiku-4-5-20251001 --out ../results/api_haiku.jsonBrowser-use agent:
cd browser-use-agent
uv sync # or: pip install -r requirements.txt
uv run python run_browser_agent.py --model claude-sonnet-4-20250514 --vision --out ../results/browser_sonnet.json
uv run python run_browser_agent.py --model claude-haiku-4-5-20251001 --vision --out ../results/browser_haiku_vision.json
uv run python run_browser_agent.py --model claude-haiku-4-5-20251001 --no-vision --out ../results/browser_haiku_no_vision.jsonBoth Haiku browser runs in results/ failed to produce a final answer (final_result: null) without raising an exception (error: null) — browser-use's agent loop exited cleanly after 1 cycle (vision) or 3 cycles (no-vision) without completing the task. The runner catches and records exceptions in the error field, but in these runs there was no exception to catch. Re-running may produce different outcomes; current results are kept as-is rather than retried-until-success.
./run_matrix.sh # starts Reflex fresh per-run (seed reload), writes to ./results/The script restarts Reflex before every run — the in-memory datastore mutates during a run (reviews accepted, orders updated), so without a restart the next agent sees dirty state. It also verifies clean state before each run (order 98 ordered, reviews 0/49/292/293 pending) and aborts that run if dirty.
Per run, the result JSON contains:
modelelapsed_seconds— wall timeinput_tokens,output_tokens,total_tokenscache_read_tokens,cache_creation_tokens(browser-use)tool_calls(API) orllm_calls(browser-use)final_answer/final_resulterror(browser-use only — records failures)
total_tokens is input_tokens + output_tokens only — it does not include cache_read_tokens. The browser-use Sonnet run, for example, reports total_tokens: 323,287 while having cache_read_tokens: 332,032; the model actually processed ~631 k tokens of input including cached prefixes. For raw cost-of-prompt comparisons, sum input_tokens + cache_read_tokens instead.
The browser-use script counts tokens by subclassing ChatAnthropic.ainvoke and accumulating ChatInvokeCompletion.usage per call; browser-use's own AgentHistoryList does not surface usage.
Read these before drawing conclusions from the numbers.
- Sample sizes are small. API rows are n=5; browser-Sonnet is n=3; the two browser-Haiku rows are still n=1 (they did not produce a final result on the first try and were not retried). Medians and min–max ranges are reported, but with this few trials the ranges are descriptive, not statistically meaningful.
- The API agent's tool surface is hand-written.
EventHandlerAPIPluginauto-generates the HTTP endpoints with no work on the app side — that part is genuinely zero-overhead. Butrun_api_agent.pydefines a REST-shaped tool surface (list_customers,update_order, ...) and maps each tool to a sequence of raw-handler POSTs (e.g.update_order→load_order+set_order_status_draft+save_order_status). That mapping is human-authored. It could plausibly be auto-generated from the plugin's OpenAPI spec, or skipped entirely by exposing the raw handlers to the agent — neither variant is implemented here. - Tool calls and LLM cycles are different units. The API agent's
tool_calls: 8is 8 Anthropic requests; on the wire, those expand to ~14 HTTP requests against Reflex (the multi-step handler sequences above). The browser agent'sllm_calls: 34is 34 screenshot-reason-act loops in browser-use. Don't compare these counts directly. Token totals are the more honest cost metric. - Single task. "Find Smith with the most orders, accept their pending reviews, mark their latest ordered order as delivered" is one workflow on one app. Conclusions about agent strategies in general should not lean on this dataset.
- Cache reads are not in
total_tokens. See "What gets measured" for the formula. Side-by-side comparison oftotal_tokensunderstates browser-agent input volume substantially. - Both Haiku browser runs failed. Both Haiku browser variants returned
final_result: nullwith no exception. Counting them as data points (not errors) is a deliberate choice — the runs completed without crashing — but neither row should be read as "Haiku can't do this task," only as "Haiku didn't do this task in this single attempt."
- Reflex defaults (frontend :3000, backend :8000) are overridden to
:3001/:8001so the upstream react-admin demo (if you set it up) can run on:8000side-by-side. ANTHROPIC_API_KEYwith a 10 k input-tokens/min limit will make browser-use runs slow due to retries — expected.- The Reflex app's in-memory data is not persisted between restarts; re-run after reloading
seed.jsonto reset.
The Reflex app in this repo is a port of marmelab/react-admin's "Posters Galore" demo (examples/demo). The benchmark itself does not need react-admin — it targets the Reflex port at :3001/:8001. To stand the original demo up against the same pinned dataset, see react-admin-setup/README.md: clone upstream, drop in seed.json, apply data-generator.patch, run make run-demo.