Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .skillsrc
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
droid-control/skills/agent-browser
droid-control/skills/capture
droid-control/skills/compose
droid-control/skills/desktop-control
droid-control/skills/droid-cli
droid-control/skills/droid-control
droid-control/skills/pty-capture
Expand Down
2 changes: 1 addition & 1 deletion plugins/droid-control/.factory-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "droid-control",
"description": "Terminal and browser automation for testing, demos, QA, and computer-use tasks",
"description": "Terminal, browser, and native desktop automation for testing, demos, QA, and computer-use tasks",
"version": "1.0.0"
}
9 changes: 6 additions & 3 deletions plugins/droid-control/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This is the first guardrail against agent drift. The droid does not start with "

| Route | Question | Examples |
|---|---|---|
| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, raw PTY bytes |
| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, native desktop app, raw PTY bytes |
| **Stage** | What does the workflow need? | capture, compose, verify |
| **Artifact** | Does compose need polish tools? | showcase presets, effects, keystroke overlays |

Expand All @@ -48,7 +48,7 @@ Each atom skill is a self-contained surface the droid reads at a specific point

| Atom type | Skills | Responsibility |
|---|---|---|
| Driver atoms | `tuistory`, `true-input`, `agent-browser` | How to drive a class of environment. |
| Driver atoms | `tuistory`, `true-input`, `agent-browser`, `desktop-control` | How to drive a class of environment. |
| Target atoms | `droid-cli`, `pty-capture` | Target-specific shortcuts, launch rules, and byte-capture patterns. |
| Stage atoms | `capture`, `compose`, `verify` | Lifecycle phases with explicit inputs and outputs. |
| Polish atom | `showcase` | Visual presets and cinematic layer guidance. |
Expand Down Expand Up @@ -119,7 +119,7 @@ Terminal workflows use `bin/tctl` as the only launch/control boundary. It hides

`tctl` also enforces Droid CLI launch invariants. `droid-dev` sessions must provide `--repo-root`, which lets `tctl` set `DROID_DEV_REPO_ROOT` and record provenance for the captured branch and commit.

Browser and Electron workflows intentionally do **not** go through `tctl`; they use `agent-browser`, whose persistent Playwright-backed daemon is the right control boundary for DOM snapshots, screenshots, and CDP-connected apps.
Browser/Electron and native-desktop workflows intentionally do **not** go through `tctl`. They have their own control boundaries: `agent-browser`'s persistent Playwright daemon for DOM snapshots, screenshots, and CDP-connected apps; `cua-driver`'s daemon for accessibility trees and per-`(pid, window_id)` element caches on desktop GUIs.

## Video composition

Expand Down Expand Up @@ -161,6 +161,9 @@ skills/true-input/platforms/macos.md
skills/pty-capture/platforms/linux.md
skills/pty-capture/platforms/windows.md
skills/pty-capture/platforms/macos.md
skills/desktop-control/platforms/linux.md
skills/desktop-control/platforms/windows.md
skills/desktop-control/platforms/macos.md
```

A Linux droid reads Linux Wayland instructions. A Windows VM byte-capture task reads Windows KVM instructions. The system does not rely on the droid to skim irrelevant sections correctly.
Expand Down
4 changes: 3 additions & 1 deletion plugins/droid-control/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ The `render-showcase.sh` helper owns the full pipeline: `.cast` conversion via `
| true-input | Windows (KVM) | `libvirt`, `qemu`, KVM VM with SSH |
| true-input | macOS (QEMU) | `qemu`, `socat`, macOS VM with SSH |
| agent-browser | All | `agent-browser` |
| desktop-control | All | `cua-driver` |
| compose | All | `ffmpeg`, `ffprobe`, `agg` |
| showcase | All | Node.js (>= 18), Chrome/Chromium |

Expand All @@ -98,7 +99,8 @@ pip install asciinema # terminal recording
cargo install --git https://github.com/asciinema/agg # .cast -> .gif converter
sudo apt-get install -y ffmpeg # video processing
agent-browser install # browser automation (downloads Chromium)
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash # native desktop GUI automation
cd plugins/droid-control/remotion && npm install # Remotion video rendering
```

Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser.
Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser. Native desktop GUI automation just needs cua-driver.
7 changes: 4 additions & 3 deletions plugins/droid-control/skills/capture/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ user-invocable: false

The orchestrator routed you here. This atom owns the full recording lifecycle: launch a target, execute an interaction script, collect raw outputs.

You should already have a **driver atom** loaded (tuistory, true-input, or agent-browser) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top.
You should already have a **driver atom** loaded (tuistory, true-input, agent-browser, or desktop-control) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top.

## Inputs

Expand All @@ -28,7 +28,7 @@ Before recording anything:
- Terminal size is consistent across all sessions (`--cols 120 --rows 36`)
- **Browser viewport size matches the composition layout** (see "Browser viewport sizing" below) — mismatched aspects letterbox in the final video
- Branch/worktree paths and env vars are correct
- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser
- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser, window PNGs / `recording.mp4` for desktop-control
- If comparing branches, both sessions use identical terminal / viewport dimensions and launch parameters
- For `droid-dev` captures, `--repo-root` is **mandatory** — `tctl` will refuse to launch without it
- **Color env vars are set** (see below)
Expand Down Expand Up @@ -137,6 +137,7 @@ Before handing off, confirm every expected output file exists and is non-empty:
| Visual rendering | Screenshots: `$TCTL -s <name> screenshot -o /tmp/proof-N.png` |
| Keyboard encoding | PTY bytes: `${DROID_PLUGIN_ROOT}/scripts/capture-terminal-bytes.py --backend <terminal> --combo <keys>` |
| Web/Electron | Screenshots: `agent-browser screenshot --annotate /tmp/proof-N.png` |
| Native desktop GUI | Window screenshots + AX trees: `cua-driver get_window_state '{...}' --screenshot-out-file ${RUN_DIR}/proof-N.png`; video via `cua-driver recording start/stop` |
| Before/after | Run the same sequence on both branches at the same capture points |

## Outputs
Expand All @@ -148,7 +149,7 @@ Hand these to the **compose** stage:
- clips: [/tmp/before.cast, /tmp/after.cast]
- screenshots: [/tmp/proof-1.png, /tmp/proof-2.png]
- keys: /tmp/keys.tsv (if keystroke logging was requested)
- driver: tuistory | true-input | agent-browser
- driver: tuistory | true-input | agent-browser | desktop-control
- terminal_size: 120x36 # for tuistory / true-input
- viewport: 960x1000 # for agent-browser; report so compose knows the clip aspect
```
Expand Down
127 changes: 127 additions & 0 deletions plugins/droid-control/skills/desktop-control/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
name: desktop-control
description: Background knowledge for droid-control workflows -- not invoked directly. Desktop-control driver mechanics for native GUI app automation via trycua cua-driver.
user-invocable: false
---

# Desktop-Control Driver

The orchestrator routed you here. Use these mechanics to execute your plan.

Drive native desktop GUI apps through upstream [trycua/cua](https://github.com/trycua/cua) `cua-driver`: enumerate apps and windows, snapshot accessibility trees, click/type/scroll by `element_index` or pixel coordinates, and verify by re-snapshot -- all without bringing the target to the foreground.

## When to use

- Automating a native desktop app (Finder, Notepad, System Settings, native editors)
- Driving native dialogs and security/permission sheets that no DOM or PTY can reach
- Visual QA of native UI: per-window screenshots, accessibility-tree assertions

If the target is a terminal TUI, use **tuistory** or **true-input**. If it is a web page or an Electron app, use **agent-browser** -- CDP beats accessibility trees for anything Chromium-based.

## Platform support

| Platform | Upstream tier | Read |
|---|---|---|
| macOS | Production | [platforms/macos.md](platforms/macos.md) |
| Windows | Production | [platforms/windows.md](platforms/windows.md) |
| Linux | Pre-release (real caveats) | [platforms/linux.md](platforms/linux.md) |

**Read the platform file for your target OS.** Each contains permissions, daemon launch, and platform-specific patterns and failure modes.

## Prerequisites

```bash
# one-time install: per-user, no sudo/admin
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash
# Windows (PowerShell):
# irm https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.ps1 | iex

cua-driver doctor # platform probes: permissions, daemon, accessibility plumbing
cua-driver skills install # fetch the upstream skill pack to ~/.cua-driver/skills/cua-driver
```

The upstream pack (`~/.cua-driver/skills/cua-driver/SKILL.md` + your platform's doc) is the deep reference -- full tool surface, window-state behavior matrix, forbidden-command lists -- and it updates with the binary. **Read it before any nontrivial workflow.** This atom owns the droid-control integration: routing, run isolation, delegation, evidence handoff.

## Daemon lifecycle

`element_index` workflows **require the daemon**. Without it each CLI invocation is a fresh process and the per-`(pid, window_id)` element cache dies between calls.

```bash
cua-driver serve # start the daemon (macOS needs the LaunchServices form -- see platforms/macos.md)
cua-driver status # daemon + socket health
cua-driver stop
```

Permissions are checked and granted through the driver, not by hand-editing system settings (macOS-only gate; a no-op surface on Windows/Linux):

```bash
cua-driver permissions status # read-only; answers via the running daemon
cua-driver permissions grant # attributed prompt flow -- the correct way to grant
```

## Core loop

Tool names are `snake_case` and invoked directly: `cua-driver <tool> '<json>'`. (`cua-driver call <tool>` is legacy; do not use it.) `cua-driver list-tools` for the inventory, `cua-driver describe <tool>` for any schema.

Every workflow is Discover -> Observe -> Act -> Verify against an explicit `(pid, window_id)`:

```bash
cua-driver launch_app '{"name":"TextEdit"}'
# -> {pid: 844, windows: [{window_id: 10725, ...}]} # list_windows only needed for long-lived pids
cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/before.png"
cua-driver click '{"pid":844,"window_id":10725,"element_index":14,"session":"'"${RUN_ID}"'-desktop"}'
cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/after.png"
```

**Snapshot before AND after every action.** The pre-action `get_window_state` resolves the `element_index` you are about to use -- indices are per-snapshot, per `(pid, window_id)`, and stale ones fail with `No cached AX state`. The post-action snapshot is the evidence the action landed; without it a silent no-op looks like success.

Addressing-mode preference:

1. **`element_index`** (default) -- semantic, works on hidden and backgrounded windows, no foreground change.
2. **Pixel** `click '{"pid":N,"window_id":W,"x":X,"y":Y}'` -- for surfaces the tree does not reach (canvases, custom-drawn controls). Coordinates are window-local screenshot pixels, top-left origin.
3. **Keyboard** (`press_key`, `hotkey`) and platform fallbacks -- last resort; see the platform files.

## Run isolation (ground rule 5 -> cua sessions)

cua sessions are the desktop equivalent of `tctl` session prefixes: a session owns its agent cursor, config overrides, and recording scope. Declare one per run, derived from the workflow's `RUN_ID`, and pass it on every action:

```bash
cua-driver start_session '{"session":"'"${RUN_ID}"'-desktop"}'
# ... every action carries "session":"${RUN_ID}-desktop" ...
cua-driver end_session '{"session":"'"${RUN_ID}"'-desktop"}'
```

Parallel workers each declare their **own session** and pass `creates_new_application_instance: true` to `launch_app` so each gets its own window. The element cache is keyed on `(pid, window_id)` and the cursor on `session`, so isolated workers cannot collide.

## Delegation

`cua-driver` is on PATH -- workers need no `${DROID_PLUGIN_ROOT}` resolution. As with the other drivers, give capture workers **exact commands** with the parent's run scope baked in:

```
Task prompt for a desktop capture worker:
"Run these commands in order. Report screenshot paths and any errors.
1. cua-driver start_session '{"session":"1712345678-42-notepad"}'
2. cua-driver launch_app '{"name":"Notepad","creates_new_application_instance":true}'
-> note the returned pid and window_id
3. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/before.png
4. cua-driver type_text '{"pid":<pid>,"window_id":<wid>,"element_index":<text-area>,"text":"hello","session":"1712345678-42-notepad"}'
5. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/after.png
6. cua-driver end_session '{"session":"1712345678-42-notepad"}'"
```

## Evidence handoff

| Proof type | How to capture |
|---|---|
| Window state | `get_window_state ... --screenshot-out-file ${RUN_DIR}/proof-N.png` (also keeps the PNG out of the tool response) |
| Full display | `cua-driver screenshot '{"out_file":"'"${RUN_DIR}"'/screen.png"}'` |
| Semantic assertions | `tree_markdown` from `get_window_state` (filter with `"query":"..."`) |
| Video | `cua-driver recording start` / `recording stop` -> session-scoped `recording.mp4` |

Hand PNG/mp4 paths to **compose** / **verify** like any other driver output. Keep raw tool output alongside screenshots whenever GUI behavior is the thing under test.

## Critical rules

1. **Never change the user's frontmost app.** If a command says activate, foreground, raise, or make key -- stop; the per-pid event paths exist precisely so you do not need it. Platform forbidden-lists live in the upstream pack.
2. **Re-snapshot after every action and report what you observed**, not what you intended. An unchanged tree after an action is a finding, not a formality.
3. **Destructive actions need explicit user intent.** Do not delete files, send messages, or submit forms unless the workflow asked for exactly that.
57 changes: 57 additions & 0 deletions plugins/droid-control/skills/desktop-control/platforms/linux.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Desktop-Control: Linux

cua-driver on Linux enumerates windows via **X11**, walks semantic trees via **AT-SPI**, and injects input via **XSendEvent** (synthetic events targeted at a window XID -- no focus change, nothing leaks to the user's focused app). Upstream calls this tier pre-release, and it shows: the lifecycle (install, daemon, doctor, sessions, one-shot CLI), window discovery, and per-window screenshots are solid; Wayland-native enumeration, AT-SPI tree quality, and input delivery are not. Plan workflows around the reliable half.

## Install and daemon

Same installer and lifecycle as everywhere else (no sudo, `~/.cua-driver`):

```bash
cua-driver doctor # trustworthy probes: catches missing DISPLAY, verifies X11 + AT-SPI before you waste a run
cua-driver serve # required for element_index workflows
cua-driver status
```

`cua-driver permissions` is a no-op surface on Linux.

## The Wayland boundary

Window enumeration is **X11-only**. On a modern Plasma/GNOME Wayland desktop, native-Wayland windows are invisible to `list_windows` -- which is most windows.

- Targets running under **Xwayland** (or a plain X11 session) enumerate and screenshot fine.
- To drive an app that defaults to native Wayland, force its X11 backend at launch where the toolkit allows it: `QT_QPA_PLATFORM=xcb` (Qt), `GDK_BACKEND=x11` (GTK), `--ozone-platform=x11` (Chromium/Electron).
- If the target cannot be put on X11, desktop-control cannot see it -- fall back to **agent-browser** (web/Electron) or **true-input** (terminal emulators).

## Semantic layer (AT-SPI) reliability

AT-SPI trees can collapse: the registry's `GetChildren` may time out, and Qt apps can render as a single root node even with `QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1`. When `get_window_state` returns a near-empty tree:

```bash
cua-driver config set capture_mode vision # screenshot-only snapshots
```

and work the pixel path (`click '{"pid":N,"window_id":W,"x":X,"y":Y}'`) against the returned PNG. Don't burn turns re-snapshotting hoping the tree fills in -- on this tier, pixel-first is a legitimate default.

## The toolkit boundary: synthetic input is silently dropped by Qt and GTK4

XSendEvent marks events with the `send_event` flag, and major toolkits **ignore flagged input entirely**. Verified on v0.5.1: Qt apps (kcalc) and GTK4 apps (zenity) no-op on *every* action -- pixel clicks, `press_key`, `type_text` -- while the driver reports success. There is no error to catch; only the post-action snapshot reveals it.

Practical consequence: the Act stage only works against apps that honor synthetic events (verified: winit-based apps like alacritty; generally simpler/older X11 toolkits). **Probe before committing to a workflow**: send one cheap keystroke, re-snapshot, and check it rendered. If the target ignores synthetic input, desktop-control cannot act on it on this tier -- Observe (screenshots, window enumeration) still works, but route the interaction through **agent-browser** (web/Electron) or **true-input** (terminal) instead.

## Text input is lossy even where it lands

In apps that do accept synthetic input, typing drops and mangles characters: shifted symbols can inject as their unshifted key (`*` arriving as `8`), trailing characters get dropped (verified: `type_text "echo ok42"` rendered `echo ok4`), and `type_text_chars` with generous per-char delays still loses keystrokes. `hotkey` chords (including paste shortcuts) and middle-click paste do **not** land reliably, so the clipboard is not a workaround here.

What works: short bursts plus verification. After every `type_text`, re-snapshot, compare the rendered text against what you sent, and repair the diff (`press_key` backspace, retype the missing tail). On Linux the post-action screenshot is not a formality -- it is the only way to know what actually arrived.

## Failure modes

| Symptom | Fix |
|---|---|
| Expected window missing from `list_windows` | Native-Wayland target -- relaunch it on the X11 backend (`QT_QPA_PLATFORM=xcb` / `GDK_BACKEND=x11` / `--ozone-platform=x11`) |
| Tree is a single root node / AT-SPI timeouts | `capture_mode vision` + pixel actions |
| Every action "succeeds" but nothing changes | Toolkit drops `send_event` input (Qt, GTK4) -- target is unreachable on this tier; use agent-browser or true-input for the interaction |
| Typed text arrives mangled or truncated | Verify-and-repair loop: re-snapshot, diff rendered text, backspace + retype the tail |
| `doctor` reports no DISPLAY | Run from the graphical session (or export the session's `DISPLAY`/`XAUTHORITY`), not a bare TTY/SSH context |

Deep mechanics live in the upstream pack: `~/.cua-driver/skills/cua-driver/LINUX.md`.
Loading
Loading