Factory-AI · factory-ain3sh · Jun 12, 2026 · Jun 10, 2026
diff --git a/.skillsrc b/.skillsrc
@@ -4,6 +4,7 @@
 droid-control/skills/agent-browser
 droid-control/skills/capture
 droid-control/skills/compose
+droid-control/skills/desktop-control
 droid-control/skills/droid-cli
 droid-control/skills/droid-control
 droid-control/skills/pty-capture

diff --git a/plugins/droid-control/.factory-plugin/plugin.json b/plugins/droid-control/.factory-plugin/plugin.json
@@ -1,5 +1,5 @@
 {
   "name": "droid-control",
-  "description": "Terminal and browser automation for testing, demos, QA, and computer-use tasks",
+  "description": "Terminal, browser, and native desktop automation for testing, demos, QA, and computer-use tasks",
   "version": "1.0.0"
 }
diff --git a/plugins/droid-control/ARCHITECTURE.md b/plugins/droid-control/ARCHITECTURE.md
@@ -36,7 +36,7 @@ This is the first guardrail against agent drift. The droid does not start with "
 
 | Route | Question | Examples |
 |---|---|---|
-| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, raw PTY bytes |
+| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, native desktop app, raw PTY bytes |
 | **Stage** | What does the workflow need? | capture, compose, verify |
 | **Artifact** | Does compose need polish tools? | showcase presets, effects, keystroke overlays |
 
@@ -48,7 +48,7 @@ Each atom skill is a self-contained surface the droid reads at a specific point
 
 | Atom type | Skills | Responsibility |
 |---|---|---|
-| Driver atoms | `tuistory`, `true-input`, `agent-browser` | How to drive a class of environment. |
+| Driver atoms | `tuistory`, `true-input`, `agent-browser`, `desktop-control` | How to drive a class of environment. |
 | Target atoms | `droid-cli`, `pty-capture` | Target-specific shortcuts, launch rules, and byte-capture patterns. |
 | Stage atoms | `capture`, `compose`, `verify` | Lifecycle phases with explicit inputs and outputs. |
 | Polish atom | `showcase` | Visual presets and cinematic layer guidance. |
@@ -119,7 +119,7 @@ Terminal workflows use `bin/tctl` as the only launch/control boundary. It hides
 
 `tctl` also enforces Droid CLI launch invariants. `droid-dev` sessions must provide `--repo-root`, which lets `tctl` set `DROID_DEV_REPO_ROOT` and record provenance for the captured branch and commit.
 
-Browser and Electron workflows intentionally do **not** go through `tctl`; they use `agent-browser`, whose persistent Playwright-backed daemon is the right control boundary for DOM snapshots, screenshots, and CDP-connected apps.
+Browser/Electron and native-desktop workflows intentionally do **not** go through `tctl`. They have their own control boundaries: `agent-browser`'s persistent Playwright daemon for DOM snapshots, screenshots, and CDP-connected apps; `cua-driver`'s daemon for accessibility trees and per-`(pid, window_id)` element caches on desktop GUIs.
 
 ## Video composition
 
@@ -161,6 +161,9 @@ skills/true-input/platforms/macos.md
 skills/pty-capture/platforms/linux.md
 skills/pty-capture/platforms/windows.md
 skills/pty-capture/platforms/macos.md
+skills/desktop-control/platforms/linux.md
+skills/desktop-control/platforms/windows.md
+skills/desktop-control/platforms/macos.md
 ```
 
 A Linux droid reads Linux Wayland instructions. A Windows VM byte-capture task reads Windows KVM instructions. The system does not rely on the droid to skim irrelevant sections correctly.

diff --git a/plugins/droid-control/README.md b/plugins/droid-control/README.md
@@ -89,6 +89,7 @@ The `render-showcase.sh` helper owns the full pipeline: `.cast` conversion via `
 | true-input | Windows (KVM) | `libvirt`, `qemu`, KVM VM with SSH |
 | true-input | macOS (QEMU) | `qemu`, `socat`, macOS VM with SSH |
 | agent-browser | All | `agent-browser` |
+| desktop-control | All | `cua-driver` |
 | compose | All | `ffmpeg`, `ffprobe`, `agg` |
 | showcase | All | Node.js (>= 18), Chrome/Chromium |
 
@@ -98,7 +99,8 @@ pip install asciinema                                 # terminal recording
 cargo install --git https://github.com/asciinema/agg  # .cast -> .gif converter
 sudo apt-get install -y ffmpeg                        # video processing
 agent-browser install                                 # browser automation (downloads Chromium)
+curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash  # native desktop GUI automation
 cd plugins/droid-control/remotion && npm install      # Remotion video rendering
 ```
 
-Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser.
+Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser. Native desktop GUI automation just needs cua-driver.
diff --git a/plugins/droid-control/skills/capture/SKILL.md b/plugins/droid-control/skills/capture/SKILL.md
@@ -8,7 +8,7 @@ user-invocable: false
 
 The orchestrator routed you here. This atom owns the full recording lifecycle: launch a target, execute an interaction script, collect raw outputs.
 
-You should already have a **driver atom** loaded (tuistory, true-input, or agent-browser) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top.
+You should already have a **driver atom** loaded (tuistory, true-input, agent-browser, or desktop-control) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top.
 
 ## Inputs
 
@@ -28,7 +28,7 @@ Before recording anything:
 - Terminal size is consistent across all sessions (`--cols 120 --rows 36`)
 - **Browser viewport size matches the composition layout** (see "Browser viewport sizing" below) — mismatched aspects letterbox in the final video
 - Branch/worktree paths and env vars are correct
-- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser
+- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser, window PNGs / `recording.mp4` for desktop-control
 - If comparing branches, both sessions use identical terminal / viewport dimensions and launch parameters
 - For `droid-dev` captures, `--repo-root` is **mandatory** — `tctl` will refuse to launch without it
 - **Color env vars are set** (see below)
@@ -137,6 +137,7 @@ Before handing off, confirm every expected output file exists and is non-empty:
 | Visual rendering | Screenshots: `$TCTL -s <name> screenshot -o /tmp/proof-N.png` |
 | Keyboard encoding | PTY bytes: `${DROID_PLUGIN_ROOT}/scripts/capture-terminal-bytes.py --backend <terminal> --combo <keys>` |
 | Web/Electron | Screenshots: `agent-browser screenshot --annotate /tmp/proof-N.png` |
+| Native desktop GUI | Window screenshots + AX trees: `cua-driver get_window_state '{...}' --screenshot-out-file ${RUN_DIR}/proof-N.png`; video via `cua-driver recording start/stop` |
 | Before/after | Run the same sequence on both branches at the same capture points |
 
 ## Outputs
@@ -148,7 +149,7 @@ Hand these to the **compose** stage:
 - clips: [/tmp/before.cast, /tmp/after.cast]
 - screenshots: [/tmp/proof-1.png, /tmp/proof-2.png]
 - keys: /tmp/keys.tsv (if keystroke logging was requested)
-- driver: tuistory | true-input | agent-browser
+- driver: tuistory | true-input | agent-browser | desktop-control
 - terminal_size: 120x36          # for tuistory / true-input
 - viewport: 960x1000             # for agent-browser; report so compose knows the clip aspect
 ```

diff --git a/plugins/droid-control/skills/desktop-control/SKILL.md b/plugins/droid-control/skills/desktop-control/SKILL.md
@@ -0,0 +1,127 @@
+---
+name: desktop-control
+description: Background knowledge for droid-control workflows -- not invoked directly. Desktop-control driver mechanics for native GUI app automation via trycua cua-driver.
+user-invocable: false
+---
+
+# Desktop-Control Driver
+
+The orchestrator routed you here. Use these mechanics to execute your plan.
+
+Drive native desktop GUI apps through upstream [trycua/cua](https://github.com/trycua/cua) `cua-driver`: enumerate apps and windows, snapshot accessibility trees, click/type/scroll by `element_index` or pixel coordinates, and verify by re-snapshot -- all without bringing the target to the foreground.
+
+## When to use
+
+- Automating a native desktop app (Finder, Notepad, System Settings, native editors)
+- Driving native dialogs and security/permission sheets that no DOM or PTY can reach
+- Visual QA of native UI: per-window screenshots, accessibility-tree assertions
+
+If the target is a terminal TUI, use **tuistory** or **true-input**. If it is a web page or an Electron app, use **agent-browser** -- CDP beats accessibility trees for anything Chromium-based.
+
+## Platform support
+
+| Platform | Upstream tier | Read |
+|---|---|---|
+| macOS | Production | [platforms/macos.md](platforms/macos.md) |
+| Windows | Production | [platforms/windows.md](platforms/windows.md) |
+| Linux | Pre-release (real caveats) | [platforms/linux.md](platforms/linux.md) |
+
+**Read the platform file for your target OS.** Each contains permissions, daemon launch, and platform-specific patterns and failure modes.
+
+## Prerequisites
+
+```bash
+# one-time install: per-user, no sudo/admin
+curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash
+# Windows (PowerShell):
+#   irm https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.ps1 | iex
+
+cua-driver doctor           # platform probes: permissions, daemon, accessibility plumbing
+cua-driver skills install   # fetch the upstream skill pack to ~/.cua-driver/skills/cua-driver
+```
+
+The upstream pack (`~/.cua-driver/skills/cua-driver/SKILL.md` + your platform's doc) is the deep reference -- full tool surface, window-state behavior matrix, forbidden-command lists -- and it updates with the binary. **Read it before any nontrivial workflow.** This atom owns the droid-control integration: routing, run isolation, delegation, evidence handoff.
+
+## Daemon lifecycle
+
+`element_index` workflows **require the daemon**. Without it each CLI invocation is a fresh process and the per-`(pid, window_id)` element cache dies between calls.
+
+```bash
+cua-driver serve            # start the daemon (macOS needs the LaunchServices form -- see platforms/macos.md)
+cua-driver status           # daemon + socket health
+cua-driver stop
+```
+
+Permissions are checked and granted through the driver, not by hand-editing system settings (macOS-only gate; a no-op surface on Windows/Linux):
+
+```bash
+cua-driver permissions status   # read-only; answers via the running daemon
+cua-driver permissions grant    # attributed prompt flow -- the correct way to grant
+```
+
+## Core loop
+
+Tool names are `snake_case` and invoked directly: `cua-driver <tool> '<json>'`. (`cua-driver call <tool>` is legacy; do not use it.) `cua-driver list-tools` for the inventory, `cua-driver describe <tool>` for any schema.
+
+Every workflow is Discover -> Observe -> Act -> Verify against an explicit `(pid, window_id)`:
+
+```bash
+cua-driver launch_app '{"name":"TextEdit"}'
+#  -> {pid: 844, windows: [{window_id: 10725, ...}]}   # list_windows only needed for long-lived pids
+cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/before.png"
+cua-driver click '{"pid":844,"window_id":10725,"element_index":14,"session":"'"${RUN_ID}"'-desktop"}'
+cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/after.png"
+```
+
+**Snapshot before AND after every action.** The pre-action `get_window_state` resolves the `element_index` you are about to use -- indices are per-snapshot, per `(pid, window_id)`, and stale ones fail with `No cached AX state`. The post-action snapshot is the evidence the action landed; without it a silent no-op looks like success.
+
+Addressing-mode preference:
+
+1. **`element_index`** (default) -- semantic, works on hidden and backgrounded windows, no foreground change.
+2. **Pixel** `click '{"pid":N,"window_id":W,"x":X,"y":Y}'` -- for surfaces the tree does not reach (canvases, custom-drawn controls). Coordinates are window-local screenshot pixels, top-left origin.
+3. **Keyboard** (`press_key`, `hotkey`) and platform fallbacks -- last resort; see the platform files.
+
+## Run isolation (ground rule 5 -> cua sessions)
+
+cua sessions are the desktop equivalent of `tctl` session prefixes: a session owns its agent cursor, config overrides, and recording scope. Declare one per run, derived from the workflow's `RUN_ID`, and pass it on every action:
+
+```bash
+cua-driver start_session '{"session":"'"${RUN_ID}"'-desktop"}'
+# ... every action carries "session":"${RUN_ID}-desktop" ...
+cua-driver end_session '{"session":"'"${RUN_ID}"'-desktop"}'
+```
+
+Parallel workers each declare their **own session** and pass `creates_new_application_instance: true` to `launch_app` so each gets its own window. The element cache is keyed on `(pid, window_id)` and the cursor on `session`, so isolated workers cannot collide.
+
+## Delegation
+
+`cua-driver` is on PATH -- workers need no `${DROID_PLUGIN_ROOT}` resolution. As with the other drivers, give capture workers **exact commands** with the parent's run scope baked in:
+
+```
+Task prompt for a desktop capture worker:
+  "Run these commands in order. Report screenshot paths and any errors.
+   1. cua-driver start_session '{"session":"1712345678-42-notepad"}'
+   2. cua-driver launch_app '{"name":"Notepad","creates_new_application_instance":true}'
+      -> note the returned pid and window_id
+   3. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/before.png
+   4. cua-driver type_text '{"pid":<pid>,"window_id":<wid>,"element_index":<text-area>,"text":"hello","session":"1712345678-42-notepad"}'
+   5. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/after.png
+   6. cua-driver end_session '{"session":"1712345678-42-notepad"}'"
+```
+
+## Evidence handoff
+
+| Proof type | How to capture |
+|---|---|
+| Window state | `get_window_state ... --screenshot-out-file ${RUN_DIR}/proof-N.png` (also keeps the PNG out of the tool response) |
+| Full display | `cua-driver screenshot '{"out_file":"'"${RUN_DIR}"'/screen.png"}'` |
+| Semantic assertions | `tree_markdown` from `get_window_state` (filter with `"query":"..."`) |
+| Video | `cua-driver recording start` / `recording stop` -> session-scoped `recording.mp4` |
+
+Hand PNG/mp4 paths to **compose** / **verify** like any other driver output. Keep raw tool output alongside screenshots whenever GUI behavior is the thing under test.
+
+## Critical rules
+
+1. **Never change the user's frontmost app.** If a command says activate, foreground, raise, or make key -- stop; the per-pid event paths exist precisely so you do not need it. Platform forbidden-lists live in the upstream pack.
+2. **Re-snapshot after every action and report what you observed**, not what you intended. An unchanged tree after an action is a finding, not a formality.
+3. **Destructive actions need explicit user intent.** Do not delete files, send messages, or submit forms unless the workflow asked for exactly that.
diff --git a/plugins/droid-control/skills/desktop-control/platforms/linux.md b/plugins/droid-control/skills/desktop-control/platforms/linux.md
@@ -0,0 +1,57 @@
+# Desktop-Control: Linux
+
+cua-driver on Linux enumerates windows via **X11**, walks semantic trees via **AT-SPI**, and injects input via **XSendEvent** (synthetic events targeted at a window XID -- no focus change, nothing leaks to the user's focused app). Upstream calls this tier pre-release, and it shows: the lifecycle (install, daemon, doctor, sessions, one-shot CLI), window discovery, and per-window screenshots are solid; Wayland-native enumeration, AT-SPI tree quality, and input delivery are not. Plan workflows around the reliable half.
+
+## Install and daemon
+
+Same installer and lifecycle as everywhere else (no sudo, `~/.cua-driver`):
+
+```bash
+cua-driver doctor    # trustworthy probes: catches missing DISPLAY, verifies X11 + AT-SPI before you waste a run
+cua-driver serve     # required for element_index workflows
+cua-driver status
+```
+
+`cua-driver permissions` is a no-op surface on Linux.
+
+## The Wayland boundary
+
+Window enumeration is **X11-only**. On a modern Plasma/GNOME Wayland desktop, native-Wayland windows are invisible to `list_windows` -- which is most windows.
+
+- Targets running under **Xwayland** (or a plain X11 session) enumerate and screenshot fine.
+- To drive an app that defaults to native Wayland, force its X11 backend at launch where the toolkit allows it: `QT_QPA_PLATFORM=xcb` (Qt), `GDK_BACKEND=x11` (GTK), `--ozone-platform=x11` (Chromium/Electron).
+- If the target cannot be put on X11, desktop-control cannot see it -- fall back to **agent-browser** (web/Electron) or **true-input** (terminal emulators).
+
+## Semantic layer (AT-SPI) reliability
+
+AT-SPI trees can collapse: the registry's `GetChildren` may time out, and Qt apps can render as a single root node even with `QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1`. When `get_window_state` returns a near-empty tree:
+
+```bash
+cua-driver config set capture_mode vision   # screenshot-only snapshots
+```
+
+and work the pixel path (`click '{"pid":N,"window_id":W,"x":X,"y":Y}'`) against the returned PNG. Don't burn turns re-snapshotting hoping the tree fills in -- on this tier, pixel-first is a legitimate default.
+
+## The toolkit boundary: synthetic input is silently dropped by Qt and GTK4
+
+XSendEvent marks events with the `send_event` flag, and major toolkits **ignore flagged input entirely**. Verified on v0.5.1: Qt apps (kcalc) and GTK4 apps (zenity) no-op on *every* action -- pixel clicks, `press_key`, `type_text` -- while the driver reports success. There is no error to catch; only the post-action snapshot reveals it.
+
+Practical consequence: the Act stage only works against apps that honor synthetic events (verified: winit-based apps like alacritty; generally simpler/older X11 toolkits). **Probe before committing to a workflow**: send one cheap keystroke, re-snapshot, and check it rendered. If the target ignores synthetic input, desktop-control cannot act on it on this tier -- Observe (screenshots, window enumeration) still works, but route the interaction through **agent-browser** (web/Electron) or **true-input** (terminal) instead.
+
+## Text input is lossy even where it lands
+
+In apps that do accept synthetic input, typing drops and mangles characters: shifted symbols can inject as their unshifted key (`*` arriving as `8`), trailing characters get dropped (verified: `type_text "echo ok42"` rendered `echo ok4`), and `type_text_chars` with generous per-char delays still loses keystrokes. `hotkey` chords (including paste shortcuts) and middle-click paste do **not** land reliably, so the clipboard is not a workaround here.
+
+What works: short bursts plus verification. After every `type_text`, re-snapshot, compare the rendered text against what you sent, and repair the diff (`press_key` backspace, retype the missing tail). On Linux the post-action screenshot is not a formality -- it is the only way to know what actually arrived.
+
+## Failure modes
+
+| Symptom | Fix |
+|---|---|
+| Expected window missing from `list_windows` | Native-Wayland target -- relaunch it on the X11 backend (`QT_QPA_PLATFORM=xcb` / `GDK_BACKEND=x11` / `--ozone-platform=x11`) |
+| Tree is a single root node / AT-SPI timeouts | `capture_mode vision` + pixel actions |
+| Every action "succeeds" but nothing changes | Toolkit drops `send_event` input (Qt, GTK4) -- target is unreachable on this tier; use agent-browser or true-input for the interaction |
+| Typed text arrives mangled or truncated | Verify-and-repair loop: re-snapshot, diff rendered text, backspace + retype the tail |
+| `doctor` reports no DISPLAY | Run from the graphical session (or export the session's `DISPLAY`/`XAUTHORITY`), not a bare TTY/SSH context |
+
+Deep mechanics live in the upstream pack: `~/.cua-driver/skills/cua-driver/LINUX.md`.