Computer Use SDK

An evaluation and benchmarking framework for computer-use capabilities across AI models. Tests click accuracy, screen selection, visual reasoning, and multi-monitor coordination in controlled Docker environments.

Results

Single Screen (1024x768)

Both models achieve near-perfect accuracy with the universal harness:

Model	Vision	Color	Location	Click	Decoy	Score
Claude Opus 4.6	PASS	PASS	0px	0px	0px	5/5
GPT-5.4	PASS	PASS	2px	2px	1px	5/5

Multi-Screen — Two Separate Windows (1024x768 + 1024x768)

Two Firefox windows, one per screen. Tests screen selection and cross-screen clicking:

Model	S1 Clicks	S2 Clicks	Disambiguation	Multi-click	Score
Claude Opus 4.6	0-3px	1-4px	0px (Save S1 vs S2)	S1→S2 2px	10/10
GPT-5.4	1px	0-8px	2px	S1→S2 2px	9/10

Multi-Screen — Spanning Firefox (1920x1080 + 1024x768)

Single Firefox window dragged across two asymmetric monitors. The hardest scenario — different resolutions, screen scaling, and the model must determine which screen each element falls on:

Model	S1 Clicks (1280x720 scaled)	S2 Clicks (1024x768 native)	Screen Selection	Score
Claude Opus 4.6	1-3px	1-3px	100% correct	9/10
GPT-5.4	1-17px	0-5px	100% correct	9/10

Key Findings

Both models click with 0-3px accuracy when screenshots are passed correctly
The harness matters more than the model — GPT-5.4 went from 0% to 100% by fixing how screenshots are passed (user-message images, not just tool results)
Multi-screen works — per-screen screenshots with burned "SCREEN N" labels give near-perfect screen selection with 0-5px coordinate accuracy
Asymmetric resolutions work — 1920x1080 + 1024x768 with screen scaling (1.5x) doesn't degrade accuracy
Self-calibration is essential — the HTML pages report their own element positions via document.title, so the eval works regardless of Firefox chrome, window position, or screen layout
Page state must be reset between models — without refreshing, accumulated clicks/scrolls from model A corrupt calibration for model B

Architecture

┌─────────────────────────────────────────┐
│  LLMProvider (interface)                │
│  Implement sendTurn() to add a provider │
├─────────────┬─────────────┬─────────────┤
│ OpenAI      │ Bedrock     │ Add yours:  │
│ Provider    │ Anthropic   │ implement   │
│ (Chat       │ Provider    │ sendTurn()  │
│ Completions)│ (InvokeModel│ ~50 lines   │
│             │ API)        │             │
└──────┬──────┴──────┬──────┴──────┬──────┘
       └─────────────┼─────────────┘
                     ▼
       ┌─────────────────────────┐
       │  UniversalCUHarness     │
       │  - Tool definitions     │
       │  - Agent loop           │
       │  - Tool execution       │
       │  - Step tracking        │
       │  - Multi-screen support │
       └────────────┬────────────┘
                    ▼
       ┌─────────────────────────┐
       │  CUAPlatform            │
       │  LinuxPlatform (Docker) │
       │  - xte for mouse/kbd   │
       │  - scrot for screenshots│
       │  - ImageMagick for crop │
       │  - Per-screen capture   │
       └─────────────────────────┘
                    ▼
       ┌─────────────────────────┐
       │  Docker Container       │
       │  Ubuntu + openbox + VNC │
       │  Asymmetric dual-screen │
       │  Self-calibrating HTML  │
       └─────────────────────────┘

Adding a new provider

Implement the LLMProvider interface:

import type { LLMProvider, ProviderResponse, ToolDefinition, ToolResultContent } from '@computer-use-sdk/harness';

class MyProvider implements LLMProvider {
  reset(): void { /* clear conversation state */ }

  async sendTurn(options: {
    systemPrompt: string;
    userMessage?: string;
    toolResults?: Array<{ callId: string; content: ToolResultContent[] }>;
    tools: ToolDefinition[];
  }): Promise<ProviderResponse> {
    // Send to your LLM, parse response, return { toolCalls, text, done }
  }
}

Then use it:

const harness = new UniversalCUHarness({
  provider: new MyProvider({ model: 'my-model', apiKey: '...' }),
  platform: new LinuxPlatform(),
  displayWidth: 1024,
  displayHeight: 768,
  multiScreen: true,  // enables per-screen screenshots + screen param on tools
});

Evals

Cross-Screen Interaction (new)

10-test suite across two monitors. Tests clicking targets on both screens, disambiguating identical elements across screens (e.g., "Save" on Screen 1 vs Screen 2), navigating between screens, and handling decoys. Supports three configurations:

Two separate windows — one Firefox per screen (most realistic)
Spanning Firefox — single window dragged across both monitors
Asymmetric screens — different resolutions (1920x1080 + 1024x768)

Self-calibrating: HTML pages report element positions via document.title using mozInnerScreenX/Y, so expected coordinates adapt automatically to any window layout.

Click Calibration

48-button grid at known pixel positions. Measures coordinate accuracy and hit rate.

Visual Reasoning

10 targets across 3 difficulty levels with decoy elements:

Easy: Large colored buttons (RED, BLUE, GREEN, STAR)
Medium: Text buttons to distinguish (Submit, Cancel, Delete)
Hard: Tiny elements with decoys (OK vs ok vs Ok, case-sensitive checkboxes)

Ablation Diagnostic

5-test suite that isolates WHERE a model fails:

Vision — can it see/describe the screenshot?
Color ID — can it identify visual properties?
Location — can it estimate pixel coordinates?
Click — can it click accurately on a target?
Decoy — can it distinguish similar elements?

Getting Started

Prerequisites

Node.js 22+
Docker Desktop

Setup

git clone <repo-url>
cd computer-use-sdk
npm install

# Build all packages
npx tsc -p packages/core/tsconfig.json
npx tsc -p packages/harness/tsconfig.json
npx tsc -p packages/eval/tsconfig.json

# Build Docker image
docker build -t computer-use-sdk -f docker/Dockerfile docker/

Run dual-monitor simulation

# Asymmetric dual-screen: 1920x1080 primary + 1024x768 secondary
docker run -d --name cua-test -p 5901:5901 \
  -e DUAL_MONITOR=true \
  -e RESOLUTION=1920x1080 \
  -e SECOND_RESOLUTION=1024x768 \
  -v "$(pwd)/packages:/opt/computer-use-sdk/packages" \
  -v "$(pwd)/results:/opt/computer-use-sdk/results" \
  -v "$(pwd)/node_modules:/opt/computer-use-sdk/node_modules" \
  -v "$(pwd)/package.json:/opt/computer-use-sdk/package.json" \
  -v "$(pwd)/docker/targets:/opt/targets" \
  --shm-size=256m computer-use-sdk

# View the desktop via noVNC (browser-based)
pip3 install websockify
cd /tmp && git clone --depth 1 https://github.com/novnc/noVNC.git
websockify --web /tmp/noVNC 6080 127.0.0.1:5901 &
open http://localhost:6080/vnc.html?autoconnect=true

# Launch spanning Firefox across both screens
docker exec -e DISPLAY=:1 cua-test bash -c '
  firefox-esr --no-remote file:///opt/targets/cross-screen-span.html &'
sleep 8
docker exec -e DISPLAY=:1 cua-test bash -c '
  WID=$(xdotool search --class Firefox | tail -1)
  xdotool windowmove "$WID" 20 20
  xdotool windowsize "$WID" 2900 700'

# Or two separate windows (one per screen)
docker exec -e DISPLAY=:1 cua-test bash -c '
  firefox-esr --no-remote file:///opt/targets/cross-screen-left.html &'
sleep 6
docker exec -e DISPLAY=:1 cua-test bash -c '
  mkdir -p /tmp/ff-profile2
  firefox-esr --no-remote -profile /tmp/ff-profile2 \
    file:///opt/targets/cross-screen-right.html &'

Run the cross-screen eval

docker exec -e DISPLAY=:1 \
  -e SCREEN_CONFIG="1920x1080+0+0,1024x768+1920+0" \
  -e AWS_ACCESS_KEY_ID=... \
  -e AWS_SECRET_ACCESS_KEY=... \
  -e AWS_REGION=us-east-1 \
  -e OPENAI_API_KEY=sk-... \
  cua-test bash -c \
  'cd /opt/computer-use-sdk && npx tsx packages/eval/src/run-cross-screen.ts'

Results are saved to results/ as ATIF trajectories and can be uploaded to trajectories.sh:

npx trajectories-sh auth login --api-key YOUR_KEY
npx trajectories-sh upload trajectory ./results/cross-screen-... \
  --visibility public --slug "my-eval"

Trajectories

All eval results are uploaded to trajectories.sh in ATIF format with per-screen screenshots:

Cross-Screen Span (asymmetric) — 1920x1080 + 1024x768, spanning Firefox
Cross-Screen Two Windows — separate Firefox per screen
Cross-Screen with screen field — ATIF includes screen in tool calls
Multi-Screen Ablation — per-screen screenshots in every step

Project Structure

packages/
  core/           # CUA primitives (screenshot, click, display mapping)
    src/
      types.ts           # CUAPlatform, ScreenInfo, MultiScreenshot interfaces
      display.ts         # Coordinate mapping (model space ↔ screen space)
      platform-linux.ts  # xte/scrot implementation, multi-screen capture
      recorder.ts        # Screenshot-based MP4 recording
  harness/        # LLM integration layer
    src/
      providers/
        types.ts              # LLMProvider interface
        openai.ts             # OpenAI Chat Completions
        bedrock-anthropic.ts  # Bedrock InvokeModel
      universal-cu.ts         # Model-agnostic harness (recommended)
      invoke-model-native.ts  # Anthropic native CU via Bedrock
      openai-native-cu.ts     # OpenAI Responses API native CU
  eval/           # Evaluation framework
    src/
      tasks/
        click-calibration.ts   # Grid-based coordinate accuracy
        visual-reasoning.ts    # Multi-difficulty visual targets
        ablation.ts            # Diagnostic test suite
      run-cross-screen.ts     # Cross-screen eval (self-calibrating)
      run-atif-multiscreen.ts # Multi-screen ATIF comparison
      run-ablation.ts         # Ablation runner
      run-visual.ts           # Visual reasoning runner
      atif.ts                 # ATIF trajectory writer
docker/           # Simulation environment
  Dockerfile           # Ubuntu + openbox + VNC + Firefox ESR
  scripts/start-vnc.sh # Single/dual/asymmetric monitor startup
  targets/
    cross-screen-span.html    # Single Firefox spanning both screens
    cross-screen-left.html    # Screen 1 targets (separate window mode)
    cross-screen-right.html   # Screen 2 targets (separate window mode)
    calibration-grid.html     # 48-button click target grid
    visual-reasoning.html     # Difficulty-tiered visual targets
results/          # Eval output (ATIF trajectories, JSON, MP4 videos)

Background

This SDK was built to solve coordinate accuracy problems in the Refresh Editor, a VS Code fork with a built-in computer-use agent. The editor's dual-monitor setup (4480x1440 scaled to 1280x411) caused consistent click drift.

Investigation in this SDK proved the issue was aspect ratio compression, not a model limitation. At standard resolutions, both Claude and GPT-5.4 click with 0-3px accuracy. The solution for multi-monitor: per-screen screenshots with independent coordinate spaces, self-calibrating HTML targets, and page refresh between model runs.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
configs		configs
docker		docker
packages		packages
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Use SDK

Results

Single Screen (1024x768)

Multi-Screen — Two Separate Windows (1024x768 + 1024x768)

Multi-Screen — Spanning Firefox (1920x1080 + 1024x768)

Key Findings

Architecture

Adding a new provider

Evals

Cross-Screen Interaction (new)

Click Calibration

Visual Reasoning

Ablation Diagnostic

Getting Started

Prerequisites

Setup

Run dual-monitor simulation

Run the cross-screen eval

Trajectories

Project Structure

Background

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Computer Use SDK

Results

Single Screen (1024x768)

Multi-Screen — Two Separate Windows (1024x768 + 1024x768)

Multi-Screen — Spanning Firefox (1920x1080 + 1024x768)

Key Findings

Architecture

Adding a new provider

Evals

Cross-Screen Interaction (new)

Click Calibration

Visual Reasoning

Ablation Diagnostic

Getting Started

Prerequisites

Setup

Run dual-monitor simulation

Run the cross-screen eval

Trajectories

Project Structure

Background

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages