Skip to content

refreshdotdev/computer-use-sdk-for-editor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Computer Use SDK

An evaluation and benchmarking framework for computer-use capabilities across AI models. Tests click accuracy, screen selection, visual reasoning, and multi-monitor coordination in controlled Docker environments.

Results

Single Screen (1024x768)

Both models achieve near-perfect accuracy with the universal harness:

Model Vision Color Location Click Decoy Score
Claude Opus 4.6 PASS PASS 0px 0px 0px 5/5
GPT-5.4 PASS PASS 2px 2px 1px 5/5

Multi-Screen — Two Separate Windows (1024x768 + 1024x768)

Two Firefox windows, one per screen. Tests screen selection and cross-screen clicking:

Model S1 Clicks S2 Clicks Disambiguation Multi-click Score
Claude Opus 4.6 0-3px 1-4px 0px (Save S1 vs S2) S1→S2 2px 10/10
GPT-5.4 1px 0-8px 2px S1→S2 2px 9/10

Multi-Screen — Spanning Firefox (1920x1080 + 1024x768)

Single Firefox window dragged across two asymmetric monitors. The hardest scenario — different resolutions, screen scaling, and the model must determine which screen each element falls on:

Model S1 Clicks (1280x720 scaled) S2 Clicks (1024x768 native) Screen Selection Score
Claude Opus 4.6 1-3px 1-3px 100% correct 9/10
GPT-5.4 1-17px 0-5px 100% correct 9/10

Key Findings

  • Both models click with 0-3px accuracy when screenshots are passed correctly
  • The harness matters more than the model — GPT-5.4 went from 0% to 100% by fixing how screenshots are passed (user-message images, not just tool results)
  • Multi-screen works — per-screen screenshots with burned "SCREEN N" labels give near-perfect screen selection with 0-5px coordinate accuracy
  • Asymmetric resolutions work — 1920x1080 + 1024x768 with screen scaling (1.5x) doesn't degrade accuracy
  • Self-calibration is essential — the HTML pages report their own element positions via document.title, so the eval works regardless of Firefox chrome, window position, or screen layout
  • Page state must be reset between models — without refreshing, accumulated clicks/scrolls from model A corrupt calibration for model B

Architecture

┌─────────────────────────────────────────┐
│  LLMProvider (interface)                │
│  Implement sendTurn() to add a provider │
├─────────────┬─────────────┬─────────────┤
│ OpenAI      │ Bedrock     │ Add yours:  │
│ Provider    │ Anthropic   │ implement   │
│ (Chat       │ Provider    │ sendTurn()  │
│ Completions)│ (InvokeModel│ ~50 lines   │
│             │ API)        │             │
└──────┬──────┴──────┬──────┴──────┬──────┘
       └─────────────┼─────────────┘
                     ▼
       ┌─────────────────────────┐
       │  UniversalCUHarness     │
       │  - Tool definitions     │
       │  - Agent loop           │
       │  - Tool execution       │
       │  - Step tracking        │
       │  - Multi-screen support │
       └────────────┬────────────┘
                    ▼
       ┌─────────────────────────┐
       │  CUAPlatform            │
       │  LinuxPlatform (Docker) │
       │  - xte for mouse/kbd   │
       │  - scrot for screenshots│
       │  - ImageMagick for crop │
       │  - Per-screen capture   │
       └─────────────────────────┘
                    ▼
       ┌─────────────────────────┐
       │  Docker Container       │
       │  Ubuntu + openbox + VNC │
       │  Asymmetric dual-screen │
       │  Self-calibrating HTML  │
       └─────────────────────────┘

Adding a new provider

Implement the LLMProvider interface:

import type { LLMProvider, ProviderResponse, ToolDefinition, ToolResultContent } from '@computer-use-sdk/harness';

class MyProvider implements LLMProvider {
  reset(): void { /* clear conversation state */ }

  async sendTurn(options: {
    systemPrompt: string;
    userMessage?: string;
    toolResults?: Array<{ callId: string; content: ToolResultContent[] }>;
    tools: ToolDefinition[];
  }): Promise<ProviderResponse> {
    // Send to your LLM, parse response, return { toolCalls, text, done }
  }
}

Then use it:

const harness = new UniversalCUHarness({
  provider: new MyProvider({ model: 'my-model', apiKey: '...' }),
  platform: new LinuxPlatform(),
  displayWidth: 1024,
  displayHeight: 768,
  multiScreen: true,  // enables per-screen screenshots + screen param on tools
});

Evals

Cross-Screen Interaction (new)

10-test suite across two monitors. Tests clicking targets on both screens, disambiguating identical elements across screens (e.g., "Save" on Screen 1 vs Screen 2), navigating between screens, and handling decoys. Supports three configurations:

  • Two separate windows — one Firefox per screen (most realistic)
  • Spanning Firefox — single window dragged across both monitors
  • Asymmetric screens — different resolutions (1920x1080 + 1024x768)

Self-calibrating: HTML pages report element positions via document.title using mozInnerScreenX/Y, so expected coordinates adapt automatically to any window layout.

Click Calibration

48-button grid at known pixel positions. Measures coordinate accuracy and hit rate.

Visual Reasoning

10 targets across 3 difficulty levels with decoy elements:

  • Easy: Large colored buttons (RED, BLUE, GREEN, STAR)
  • Medium: Text buttons to distinguish (Submit, Cancel, Delete)
  • Hard: Tiny elements with decoys (OK vs ok vs Ok, case-sensitive checkboxes)

Ablation Diagnostic

5-test suite that isolates WHERE a model fails:

  1. Vision — can it see/describe the screenshot?
  2. Color ID — can it identify visual properties?
  3. Location — can it estimate pixel coordinates?
  4. Click — can it click accurately on a target?
  5. Decoy — can it distinguish similar elements?

Getting Started

Prerequisites

  • Node.js 22+
  • Docker Desktop

Setup

git clone <repo-url>
cd computer-use-sdk
npm install

# Build all packages
npx tsc -p packages/core/tsconfig.json
npx tsc -p packages/harness/tsconfig.json
npx tsc -p packages/eval/tsconfig.json

# Build Docker image
docker build -t computer-use-sdk -f docker/Dockerfile docker/

Run dual-monitor simulation

# Asymmetric dual-screen: 1920x1080 primary + 1024x768 secondary
docker run -d --name cua-test -p 5901:5901 \
  -e DUAL_MONITOR=true \
  -e RESOLUTION=1920x1080 \
  -e SECOND_RESOLUTION=1024x768 \
  -v "$(pwd)/packages:/opt/computer-use-sdk/packages" \
  -v "$(pwd)/results:/opt/computer-use-sdk/results" \
  -v "$(pwd)/node_modules:/opt/computer-use-sdk/node_modules" \
  -v "$(pwd)/package.json:/opt/computer-use-sdk/package.json" \
  -v "$(pwd)/docker/targets:/opt/targets" \
  --shm-size=256m computer-use-sdk

# View the desktop via noVNC (browser-based)
pip3 install websockify
cd /tmp && git clone --depth 1 https://github.com/novnc/noVNC.git
websockify --web /tmp/noVNC 6080 127.0.0.1:5901 &
open http://localhost:6080/vnc.html?autoconnect=true

# Launch spanning Firefox across both screens
docker exec -e DISPLAY=:1 cua-test bash -c '
  firefox-esr --no-remote file:///opt/targets/cross-screen-span.html &'
sleep 8
docker exec -e DISPLAY=:1 cua-test bash -c '
  WID=$(xdotool search --class Firefox | tail -1)
  xdotool windowmove "$WID" 20 20
  xdotool windowsize "$WID" 2900 700'

# Or two separate windows (one per screen)
docker exec -e DISPLAY=:1 cua-test bash -c '
  firefox-esr --no-remote file:///opt/targets/cross-screen-left.html &'
sleep 6
docker exec -e DISPLAY=:1 cua-test bash -c '
  mkdir -p /tmp/ff-profile2
  firefox-esr --no-remote -profile /tmp/ff-profile2 \
    file:///opt/targets/cross-screen-right.html &'

Run the cross-screen eval

docker exec -e DISPLAY=:1 \
  -e SCREEN_CONFIG="1920x1080+0+0,1024x768+1920+0" \
  -e AWS_ACCESS_KEY_ID=... \
  -e AWS_SECRET_ACCESS_KEY=... \
  -e AWS_REGION=us-east-1 \
  -e OPENAI_API_KEY=sk-... \
  cua-test bash -c \
  'cd /opt/computer-use-sdk && npx tsx packages/eval/src/run-cross-screen.ts'

Results are saved to results/ as ATIF trajectories and can be uploaded to trajectories.sh:

npx trajectories-sh auth login --api-key YOUR_KEY
npx trajectories-sh upload trajectory ./results/cross-screen-... \
  --visibility public --slug "my-eval"

Trajectories

All eval results are uploaded to trajectories.sh in ATIF format with per-screen screenshots:

Project Structure

packages/
  core/           # CUA primitives (screenshot, click, display mapping)
    src/
      types.ts           # CUAPlatform, ScreenInfo, MultiScreenshot interfaces
      display.ts         # Coordinate mapping (model space ↔ screen space)
      platform-linux.ts  # xte/scrot implementation, multi-screen capture
      recorder.ts        # Screenshot-based MP4 recording
  harness/        # LLM integration layer
    src/
      providers/
        types.ts              # LLMProvider interface
        openai.ts             # OpenAI Chat Completions
        bedrock-anthropic.ts  # Bedrock InvokeModel
      universal-cu.ts         # Model-agnostic harness (recommended)
      invoke-model-native.ts  # Anthropic native CU via Bedrock
      openai-native-cu.ts     # OpenAI Responses API native CU
  eval/           # Evaluation framework
    src/
      tasks/
        click-calibration.ts   # Grid-based coordinate accuracy
        visual-reasoning.ts    # Multi-difficulty visual targets
        ablation.ts            # Diagnostic test suite
      run-cross-screen.ts     # Cross-screen eval (self-calibrating)
      run-atif-multiscreen.ts # Multi-screen ATIF comparison
      run-ablation.ts         # Ablation runner
      run-visual.ts           # Visual reasoning runner
      atif.ts                 # ATIF trajectory writer
docker/           # Simulation environment
  Dockerfile           # Ubuntu + openbox + VNC + Firefox ESR
  scripts/start-vnc.sh # Single/dual/asymmetric monitor startup
  targets/
    cross-screen-span.html    # Single Firefox spanning both screens
    cross-screen-left.html    # Screen 1 targets (separate window mode)
    cross-screen-right.html   # Screen 2 targets (separate window mode)
    calibration-grid.html     # 48-button click target grid
    visual-reasoning.html     # Difficulty-tiered visual targets
results/          # Eval output (ATIF trajectories, JSON, MP4 videos)

Background

This SDK was built to solve coordinate accuracy problems in the Refresh Editor, a VS Code fork with a built-in computer-use agent. The editor's dual-monitor setup (4480x1440 scaled to 1280x411) caused consistent click drift.

Investigation in this SDK proved the issue was aspect ratio compression, not a model limitation. At standard resolutions, both Claude and GPT-5.4 click with 0-3px accuracy. The solution for multi-monitor: per-screen screenshots with independent coordinate spaces, self-calibrating HTML targets, and page refresh between model runs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors