An evaluation and benchmarking framework for computer-use capabilities across AI models. Tests click accuracy, screen selection, visual reasoning, and multi-monitor coordination in controlled Docker environments.
Both models achieve near-perfect accuracy with the universal harness:
| Model | Vision | Color | Location | Click | Decoy | Score |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | PASS | PASS | 0px | 0px | 0px | 5/5 |
| GPT-5.4 | PASS | PASS | 2px | 2px | 1px | 5/5 |
Two Firefox windows, one per screen. Tests screen selection and cross-screen clicking:
| Model | S1 Clicks | S2 Clicks | Disambiguation | Multi-click | Score |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 0-3px | 1-4px | 0px (Save S1 vs S2) | S1→S2 2px | 10/10 |
| GPT-5.4 | 1px | 0-8px | 2px | S1→S2 2px | 9/10 |
Single Firefox window dragged across two asymmetric monitors. The hardest scenario — different resolutions, screen scaling, and the model must determine which screen each element falls on:
| Model | S1 Clicks (1280x720 scaled) | S2 Clicks (1024x768 native) | Screen Selection | Score |
|---|---|---|---|---|
| Claude Opus 4.6 | 1-3px | 1-3px | 100% correct | 9/10 |
| GPT-5.4 | 1-17px | 0-5px | 100% correct | 9/10 |
- Both models click with 0-3px accuracy when screenshots are passed correctly
- The harness matters more than the model — GPT-5.4 went from 0% to 100% by fixing how screenshots are passed (user-message images, not just tool results)
- Multi-screen works — per-screen screenshots with burned "SCREEN N" labels give near-perfect screen selection with 0-5px coordinate accuracy
- Asymmetric resolutions work — 1920x1080 + 1024x768 with screen scaling (1.5x) doesn't degrade accuracy
- Self-calibration is essential — the HTML pages report their own element positions via
document.title, so the eval works regardless of Firefox chrome, window position, or screen layout - Page state must be reset between models — without refreshing, accumulated clicks/scrolls from model A corrupt calibration for model B
┌─────────────────────────────────────────┐
│ LLMProvider (interface) │
│ Implement sendTurn() to add a provider │
├─────────────┬─────────────┬─────────────┤
│ OpenAI │ Bedrock │ Add yours: │
│ Provider │ Anthropic │ implement │
│ (Chat │ Provider │ sendTurn() │
│ Completions)│ (InvokeModel│ ~50 lines │
│ │ API) │ │
└──────┬──────┴──────┬──────┴──────┬──────┘
└─────────────┼─────────────┘
▼
┌─────────────────────────┐
│ UniversalCUHarness │
│ - Tool definitions │
│ - Agent loop │
│ - Tool execution │
│ - Step tracking │
│ - Multi-screen support │
└────────────┬────────────┘
▼
┌─────────────────────────┐
│ CUAPlatform │
│ LinuxPlatform (Docker) │
│ - xte for mouse/kbd │
│ - scrot for screenshots│
│ - ImageMagick for crop │
│ - Per-screen capture │
└─────────────────────────┘
▼
┌─────────────────────────┐
│ Docker Container │
│ Ubuntu + openbox + VNC │
│ Asymmetric dual-screen │
│ Self-calibrating HTML │
└─────────────────────────┘
Implement the LLMProvider interface:
import type { LLMProvider, ProviderResponse, ToolDefinition, ToolResultContent } from '@computer-use-sdk/harness';
class MyProvider implements LLMProvider {
reset(): void { /* clear conversation state */ }
async sendTurn(options: {
systemPrompt: string;
userMessage?: string;
toolResults?: Array<{ callId: string; content: ToolResultContent[] }>;
tools: ToolDefinition[];
}): Promise<ProviderResponse> {
// Send to your LLM, parse response, return { toolCalls, text, done }
}
}Then use it:
const harness = new UniversalCUHarness({
provider: new MyProvider({ model: 'my-model', apiKey: '...' }),
platform: new LinuxPlatform(),
displayWidth: 1024,
displayHeight: 768,
multiScreen: true, // enables per-screen screenshots + screen param on tools
});10-test suite across two monitors. Tests clicking targets on both screens, disambiguating identical elements across screens (e.g., "Save" on Screen 1 vs Screen 2), navigating between screens, and handling decoys. Supports three configurations:
- Two separate windows — one Firefox per screen (most realistic)
- Spanning Firefox — single window dragged across both monitors
- Asymmetric screens — different resolutions (1920x1080 + 1024x768)
Self-calibrating: HTML pages report element positions via document.title using mozInnerScreenX/Y, so expected coordinates adapt automatically to any window layout.
48-button grid at known pixel positions. Measures coordinate accuracy and hit rate.
10 targets across 3 difficulty levels with decoy elements:
- Easy: Large colored buttons (RED, BLUE, GREEN, STAR)
- Medium: Text buttons to distinguish (Submit, Cancel, Delete)
- Hard: Tiny elements with decoys (OK vs ok vs Ok, case-sensitive checkboxes)
5-test suite that isolates WHERE a model fails:
- Vision — can it see/describe the screenshot?
- Color ID — can it identify visual properties?
- Location — can it estimate pixel coordinates?
- Click — can it click accurately on a target?
- Decoy — can it distinguish similar elements?
- Node.js 22+
- Docker Desktop
git clone <repo-url>
cd computer-use-sdk
npm install
# Build all packages
npx tsc -p packages/core/tsconfig.json
npx tsc -p packages/harness/tsconfig.json
npx tsc -p packages/eval/tsconfig.json
# Build Docker image
docker build -t computer-use-sdk -f docker/Dockerfile docker/# Asymmetric dual-screen: 1920x1080 primary + 1024x768 secondary
docker run -d --name cua-test -p 5901:5901 \
-e DUAL_MONITOR=true \
-e RESOLUTION=1920x1080 \
-e SECOND_RESOLUTION=1024x768 \
-v "$(pwd)/packages:/opt/computer-use-sdk/packages" \
-v "$(pwd)/results:/opt/computer-use-sdk/results" \
-v "$(pwd)/node_modules:/opt/computer-use-sdk/node_modules" \
-v "$(pwd)/package.json:/opt/computer-use-sdk/package.json" \
-v "$(pwd)/docker/targets:/opt/targets" \
--shm-size=256m computer-use-sdk
# View the desktop via noVNC (browser-based)
pip3 install websockify
cd /tmp && git clone --depth 1 https://github.com/novnc/noVNC.git
websockify --web /tmp/noVNC 6080 127.0.0.1:5901 &
open http://localhost:6080/vnc.html?autoconnect=true
# Launch spanning Firefox across both screens
docker exec -e DISPLAY=:1 cua-test bash -c '
firefox-esr --no-remote file:///opt/targets/cross-screen-span.html &'
sleep 8
docker exec -e DISPLAY=:1 cua-test bash -c '
WID=$(xdotool search --class Firefox | tail -1)
xdotool windowmove "$WID" 20 20
xdotool windowsize "$WID" 2900 700'
# Or two separate windows (one per screen)
docker exec -e DISPLAY=:1 cua-test bash -c '
firefox-esr --no-remote file:///opt/targets/cross-screen-left.html &'
sleep 6
docker exec -e DISPLAY=:1 cua-test bash -c '
mkdir -p /tmp/ff-profile2
firefox-esr --no-remote -profile /tmp/ff-profile2 \
file:///opt/targets/cross-screen-right.html &'docker exec -e DISPLAY=:1 \
-e SCREEN_CONFIG="1920x1080+0+0,1024x768+1920+0" \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-e AWS_REGION=us-east-1 \
-e OPENAI_API_KEY=sk-... \
cua-test bash -c \
'cd /opt/computer-use-sdk && npx tsx packages/eval/src/run-cross-screen.ts'Results are saved to results/ as ATIF trajectories and can be uploaded to trajectories.sh:
npx trajectories-sh auth login --api-key YOUR_KEY
npx trajectories-sh upload trajectory ./results/cross-screen-... \
--visibility public --slug "my-eval"All eval results are uploaded to trajectories.sh in ATIF format with per-screen screenshots:
- Cross-Screen Span (asymmetric) — 1920x1080 + 1024x768, spanning Firefox
- Cross-Screen Two Windows — separate Firefox per screen
- Cross-Screen with screen field — ATIF includes
screenin tool calls - Multi-Screen Ablation — per-screen screenshots in every step
packages/
core/ # CUA primitives (screenshot, click, display mapping)
src/
types.ts # CUAPlatform, ScreenInfo, MultiScreenshot interfaces
display.ts # Coordinate mapping (model space ↔ screen space)
platform-linux.ts # xte/scrot implementation, multi-screen capture
recorder.ts # Screenshot-based MP4 recording
harness/ # LLM integration layer
src/
providers/
types.ts # LLMProvider interface
openai.ts # OpenAI Chat Completions
bedrock-anthropic.ts # Bedrock InvokeModel
universal-cu.ts # Model-agnostic harness (recommended)
invoke-model-native.ts # Anthropic native CU via Bedrock
openai-native-cu.ts # OpenAI Responses API native CU
eval/ # Evaluation framework
src/
tasks/
click-calibration.ts # Grid-based coordinate accuracy
visual-reasoning.ts # Multi-difficulty visual targets
ablation.ts # Diagnostic test suite
run-cross-screen.ts # Cross-screen eval (self-calibrating)
run-atif-multiscreen.ts # Multi-screen ATIF comparison
run-ablation.ts # Ablation runner
run-visual.ts # Visual reasoning runner
atif.ts # ATIF trajectory writer
docker/ # Simulation environment
Dockerfile # Ubuntu + openbox + VNC + Firefox ESR
scripts/start-vnc.sh # Single/dual/asymmetric monitor startup
targets/
cross-screen-span.html # Single Firefox spanning both screens
cross-screen-left.html # Screen 1 targets (separate window mode)
cross-screen-right.html # Screen 2 targets (separate window mode)
calibration-grid.html # 48-button click target grid
visual-reasoning.html # Difficulty-tiered visual targets
results/ # Eval output (ATIF trajectories, JSON, MP4 videos)
This SDK was built to solve coordinate accuracy problems in the Refresh Editor, a VS Code fork with a built-in computer-use agent. The editor's dual-monitor setup (4480x1440 scaled to 1280x411) caused consistent click drift.
Investigation in this SDK proved the issue was aspect ratio compression, not a model limitation. At standard resolutions, both Claude and GPT-5.4 click with 0-3px accuracy. The solution for multi-monitor: per-screen screenshots with independent coordinate spaces, self-calibrating HTML targets, and page refresh between model runs.