diff --git a/.skillsrc b/.skillsrc index 3e16873..4e26c3d 100644 --- a/.skillsrc +++ b/.skillsrc @@ -4,6 +4,7 @@ droid-control/skills/agent-browser droid-control/skills/capture droid-control/skills/compose +droid-control/skills/desktop-control droid-control/skills/droid-cli droid-control/skills/droid-control droid-control/skills/pty-capture diff --git a/plugins/droid-control/.factory-plugin/plugin.json b/plugins/droid-control/.factory-plugin/plugin.json index 16e1250..8f6c249 100644 --- a/plugins/droid-control/.factory-plugin/plugin.json +++ b/plugins/droid-control/.factory-plugin/plugin.json @@ -1,5 +1,5 @@ { "name": "droid-control", - "description": "Terminal and browser automation for testing, demos, QA, and computer-use tasks", + "description": "Terminal, browser, and native desktop automation for testing, demos, QA, and computer-use tasks", "version": "1.0.0" } diff --git a/plugins/droid-control/ARCHITECTURE.md b/plugins/droid-control/ARCHITECTURE.md index ba9409d..6dc160b 100644 --- a/plugins/droid-control/ARCHITECTURE.md +++ b/plugins/droid-control/ARCHITECTURE.md @@ -36,7 +36,7 @@ This is the first guardrail against agent drift. The droid does not start with " | Route | Question | Examples | |---|---|---| -| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, raw PTY bytes | +| **Target** | What are we driving? | Droid CLI, other terminal TUI, web/Electron app, native desktop app, raw PTY bytes | | **Stage** | What does the workflow need? | capture, compose, verify | | **Artifact** | Does compose need polish tools? | showcase presets, effects, keystroke overlays | @@ -48,7 +48,7 @@ Each atom skill is a self-contained surface the droid reads at a specific point | Atom type | Skills | Responsibility | |---|---|---| -| Driver atoms | `tuistory`, `true-input`, `agent-browser` | How to drive a class of environment. | +| Driver atoms | `tuistory`, `true-input`, `agent-browser`, `desktop-control` | How to drive a class of environment. | | Target atoms | `droid-cli`, `pty-capture` | Target-specific shortcuts, launch rules, and byte-capture patterns. | | Stage atoms | `capture`, `compose`, `verify` | Lifecycle phases with explicit inputs and outputs. | | Polish atom | `showcase` | Visual presets and cinematic layer guidance. | @@ -119,7 +119,7 @@ Terminal workflows use `bin/tctl` as the only launch/control boundary. It hides `tctl` also enforces Droid CLI launch invariants. `droid-dev` sessions must provide `--repo-root`, which lets `tctl` set `DROID_DEV_REPO_ROOT` and record provenance for the captured branch and commit. -Browser and Electron workflows intentionally do **not** go through `tctl`; they use `agent-browser`, whose persistent Playwright-backed daemon is the right control boundary for DOM snapshots, screenshots, and CDP-connected apps. +Browser/Electron and native-desktop workflows intentionally do **not** go through `tctl`. They have their own control boundaries: `agent-browser`'s persistent Playwright daemon for DOM snapshots, screenshots, and CDP-connected apps; `cua-driver`'s daemon for accessibility trees and per-`(pid, window_id)` element caches on desktop GUIs. ## Video composition @@ -161,6 +161,9 @@ skills/true-input/platforms/macos.md skills/pty-capture/platforms/linux.md skills/pty-capture/platforms/windows.md skills/pty-capture/platforms/macos.md +skills/desktop-control/platforms/linux.md +skills/desktop-control/platforms/windows.md +skills/desktop-control/platforms/macos.md ``` A Linux droid reads Linux Wayland instructions. A Windows VM byte-capture task reads Windows KVM instructions. The system does not rely on the droid to skim irrelevant sections correctly. diff --git a/plugins/droid-control/README.md b/plugins/droid-control/README.md index c33cbb9..a2d97fa 100644 --- a/plugins/droid-control/README.md +++ b/plugins/droid-control/README.md @@ -89,6 +89,7 @@ The `render-showcase.sh` helper owns the full pipeline: `.cast` conversion via ` | true-input | Windows (KVM) | `libvirt`, `qemu`, KVM VM with SSH | | true-input | macOS (QEMU) | `qemu`, `socat`, macOS VM with SSH | | agent-browser | All | `agent-browser` | +| desktop-control | All | `cua-driver` | | compose | All | `ffmpeg`, `ffprobe`, `agg` | | showcase | All | Node.js (>= 18), Chrome/Chromium | @@ -98,7 +99,8 @@ pip install asciinema # terminal recording cargo install --git https://github.com/asciinema/agg # .cast -> .gif converter sudo apt-get install -y ffmpeg # video processing agent-browser install # browser automation (downloads Chromium) +curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash # native desktop GUI automation cd plugins/droid-control/remotion && npm install # Remotion video rendering ``` -Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser. +Only install what you need for your use case. Terminal demos need tuistory, asciinema, agg, and ffmpeg. Web/Electron automation just needs agent-browser. Native desktop GUI automation just needs cua-driver. diff --git a/plugins/droid-control/skills/capture/SKILL.md b/plugins/droid-control/skills/capture/SKILL.md index 037e099..fd1e523 100644 --- a/plugins/droid-control/skills/capture/SKILL.md +++ b/plugins/droid-control/skills/capture/SKILL.md @@ -8,7 +8,7 @@ user-invocable: false The orchestrator routed you here. This atom owns the full recording lifecycle: launch a target, execute an interaction script, collect raw outputs. -You should already have a **driver atom** loaded (tuistory, true-input, or agent-browser) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top. +You should already have a **driver atom** loaded (tuistory, true-input, agent-browser, or desktop-control) and optionally a **target atom** (droid-cli). This atom layers the recording discipline on top. ## Inputs @@ -28,7 +28,7 @@ Before recording anything: - Terminal size is consistent across all sessions (`--cols 120 --rows 36`) - **Browser viewport size matches the composition layout** (see "Browser viewport sizing" below) — mismatched aspects letterbox in the final video - Branch/worktree paths and env vars are correct -- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser +- Recording format matches the driver: `.cast` for tuistory, `.mp4` for true-input, screenshots for agent-browser, window PNGs / `recording.mp4` for desktop-control - If comparing branches, both sessions use identical terminal / viewport dimensions and launch parameters - For `droid-dev` captures, `--repo-root` is **mandatory** — `tctl` will refuse to launch without it - **Color env vars are set** (see below) @@ -137,6 +137,7 @@ Before handing off, confirm every expected output file exists and is non-empty: | Visual rendering | Screenshots: `$TCTL -s screenshot -o /tmp/proof-N.png` | | Keyboard encoding | PTY bytes: `${DROID_PLUGIN_ROOT}/scripts/capture-terminal-bytes.py --backend --combo ` | | Web/Electron | Screenshots: `agent-browser screenshot --annotate /tmp/proof-N.png` | +| Native desktop GUI | Window screenshots + AX trees: `cua-driver get_window_state '{...}' --screenshot-out-file ${RUN_DIR}/proof-N.png`; video via `cua-driver recording start/stop` | | Before/after | Run the same sequence on both branches at the same capture points | ## Outputs @@ -148,7 +149,7 @@ Hand these to the **compose** stage: - clips: [/tmp/before.cast, /tmp/after.cast] - screenshots: [/tmp/proof-1.png, /tmp/proof-2.png] - keys: /tmp/keys.tsv (if keystroke logging was requested) -- driver: tuistory | true-input | agent-browser +- driver: tuistory | true-input | agent-browser | desktop-control - terminal_size: 120x36 # for tuistory / true-input - viewport: 960x1000 # for agent-browser; report so compose knows the clip aspect ``` diff --git a/plugins/droid-control/skills/desktop-control/SKILL.md b/plugins/droid-control/skills/desktop-control/SKILL.md new file mode 100644 index 0000000..ce971df --- /dev/null +++ b/plugins/droid-control/skills/desktop-control/SKILL.md @@ -0,0 +1,127 @@ +--- +name: desktop-control +description: Background knowledge for droid-control workflows -- not invoked directly. Desktop-control driver mechanics for native GUI app automation via trycua cua-driver. +user-invocable: false +--- + +# Desktop-Control Driver + +The orchestrator routed you here. Use these mechanics to execute your plan. + +Drive native desktop GUI apps through upstream [trycua/cua](https://github.com/trycua/cua) `cua-driver`: enumerate apps and windows, snapshot accessibility trees, click/type/scroll by `element_index` or pixel coordinates, and verify by re-snapshot -- all without bringing the target to the foreground. + +## When to use + +- Automating a native desktop app (Finder, Notepad, System Settings, native editors) +- Driving native dialogs and security/permission sheets that no DOM or PTY can reach +- Visual QA of native UI: per-window screenshots, accessibility-tree assertions + +If the target is a terminal TUI, use **tuistory** or **true-input**. If it is a web page or an Electron app, use **agent-browser** -- CDP beats accessibility trees for anything Chromium-based. + +## Platform support + +| Platform | Upstream tier | Read | +|---|---|---| +| macOS | Production | [platforms/macos.md](platforms/macos.md) | +| Windows | Production | [platforms/windows.md](platforms/windows.md) | +| Linux | Pre-release (real caveats) | [platforms/linux.md](platforms/linux.md) | + +**Read the platform file for your target OS.** Each contains permissions, daemon launch, and platform-specific patterns and failure modes. + +## Prerequisites + +```bash +# one-time install: per-user, no sudo/admin +curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash +# Windows (PowerShell): +# irm https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.ps1 | iex + +cua-driver doctor # platform probes: permissions, daemon, accessibility plumbing +cua-driver skills install # fetch the upstream skill pack to ~/.cua-driver/skills/cua-driver +``` + +The upstream pack (`~/.cua-driver/skills/cua-driver/SKILL.md` + your platform's doc) is the deep reference -- full tool surface, window-state behavior matrix, forbidden-command lists -- and it updates with the binary. **Read it before any nontrivial workflow.** This atom owns the droid-control integration: routing, run isolation, delegation, evidence handoff. + +## Daemon lifecycle + +`element_index` workflows **require the daemon**. Without it each CLI invocation is a fresh process and the per-`(pid, window_id)` element cache dies between calls. + +```bash +cua-driver serve # start the daemon (macOS needs the LaunchServices form -- see platforms/macos.md) +cua-driver status # daemon + socket health +cua-driver stop +``` + +Permissions are checked and granted through the driver, not by hand-editing system settings (macOS-only gate; a no-op surface on Windows/Linux): + +```bash +cua-driver permissions status # read-only; answers via the running daemon +cua-driver permissions grant # attributed prompt flow -- the correct way to grant +``` + +## Core loop + +Tool names are `snake_case` and invoked directly: `cua-driver ''`. (`cua-driver call ` is legacy; do not use it.) `cua-driver list-tools` for the inventory, `cua-driver describe ` for any schema. + +Every workflow is Discover -> Observe -> Act -> Verify against an explicit `(pid, window_id)`: + +```bash +cua-driver launch_app '{"name":"TextEdit"}' +# -> {pid: 844, windows: [{window_id: 10725, ...}]} # list_windows only needed for long-lived pids +cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/before.png" +cua-driver click '{"pid":844,"window_id":10725,"element_index":14,"session":"'"${RUN_ID}"'-desktop"}' +cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/after.png" +``` + +**Snapshot before AND after every action.** The pre-action `get_window_state` resolves the `element_index` you are about to use -- indices are per-snapshot, per `(pid, window_id)`, and stale ones fail with `No cached AX state`. The post-action snapshot is the evidence the action landed; without it a silent no-op looks like success. + +Addressing-mode preference: + +1. **`element_index`** (default) -- semantic, works on hidden and backgrounded windows, no foreground change. +2. **Pixel** `click '{"pid":N,"window_id":W,"x":X,"y":Y}'` -- for surfaces the tree does not reach (canvases, custom-drawn controls). Coordinates are window-local screenshot pixels, top-left origin. +3. **Keyboard** (`press_key`, `hotkey`) and platform fallbacks -- last resort; see the platform files. + +## Run isolation (ground rule 5 -> cua sessions) + +cua sessions are the desktop equivalent of `tctl` session prefixes: a session owns its agent cursor, config overrides, and recording scope. Declare one per run, derived from the workflow's `RUN_ID`, and pass it on every action: + +```bash +cua-driver start_session '{"session":"'"${RUN_ID}"'-desktop"}' +# ... every action carries "session":"${RUN_ID}-desktop" ... +cua-driver end_session '{"session":"'"${RUN_ID}"'-desktop"}' +``` + +Parallel workers each declare their **own session** and pass `creates_new_application_instance: true` to `launch_app` so each gets its own window. The element cache is keyed on `(pid, window_id)` and the cursor on `session`, so isolated workers cannot collide. + +## Delegation + +`cua-driver` is on PATH -- workers need no `${DROID_PLUGIN_ROOT}` resolution. As with the other drivers, give capture workers **exact commands** with the parent's run scope baked in: + +``` +Task prompt for a desktop capture worker: + "Run these commands in order. Report screenshot paths and any errors. + 1. cua-driver start_session '{"session":"1712345678-42-notepad"}' + 2. cua-driver launch_app '{"name":"Notepad","creates_new_application_instance":true}' + -> note the returned pid and window_id + 3. cua-driver get_window_state '{"pid":,"window_id":}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/before.png + 4. cua-driver type_text '{"pid":,"window_id":,"element_index":,"text":"hello","session":"1712345678-42-notepad"}' + 5. cua-driver get_window_state '{"pid":,"window_id":}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/after.png + 6. cua-driver end_session '{"session":"1712345678-42-notepad"}'" +``` + +## Evidence handoff + +| Proof type | How to capture | +|---|---| +| Window state | `get_window_state ... --screenshot-out-file ${RUN_DIR}/proof-N.png` (also keeps the PNG out of the tool response) | +| Full display | `cua-driver screenshot '{"out_file":"'"${RUN_DIR}"'/screen.png"}'` | +| Semantic assertions | `tree_markdown` from `get_window_state` (filter with `"query":"..."`) | +| Video | `cua-driver recording start` / `recording stop` -> session-scoped `recording.mp4` | + +Hand PNG/mp4 paths to **compose** / **verify** like any other driver output. Keep raw tool output alongside screenshots whenever GUI behavior is the thing under test. + +## Critical rules + +1. **Never change the user's frontmost app.** If a command says activate, foreground, raise, or make key -- stop; the per-pid event paths exist precisely so you do not need it. Platform forbidden-lists live in the upstream pack. +2. **Re-snapshot after every action and report what you observed**, not what you intended. An unchanged tree after an action is a finding, not a formality. +3. **Destructive actions need explicit user intent.** Do not delete files, send messages, or submit forms unless the workflow asked for exactly that. diff --git a/plugins/droid-control/skills/desktop-control/platforms/linux.md b/plugins/droid-control/skills/desktop-control/platforms/linux.md new file mode 100644 index 0000000..13ac37d --- /dev/null +++ b/plugins/droid-control/skills/desktop-control/platforms/linux.md @@ -0,0 +1,57 @@ +# Desktop-Control: Linux + +cua-driver on Linux enumerates windows via **X11**, walks semantic trees via **AT-SPI**, and injects input via **XSendEvent** (synthetic events targeted at a window XID -- no focus change, nothing leaks to the user's focused app). Upstream calls this tier pre-release, and it shows: the lifecycle (install, daemon, doctor, sessions, one-shot CLI), window discovery, and per-window screenshots are solid; Wayland-native enumeration, AT-SPI tree quality, and input delivery are not. Plan workflows around the reliable half. + +## Install and daemon + +Same installer and lifecycle as everywhere else (no sudo, `~/.cua-driver`): + +```bash +cua-driver doctor # trustworthy probes: catches missing DISPLAY, verifies X11 + AT-SPI before you waste a run +cua-driver serve # required for element_index workflows +cua-driver status +``` + +`cua-driver permissions` is a no-op surface on Linux. + +## The Wayland boundary + +Window enumeration is **X11-only**. On a modern Plasma/GNOME Wayland desktop, native-Wayland windows are invisible to `list_windows` -- which is most windows. + +- Targets running under **Xwayland** (or a plain X11 session) enumerate and screenshot fine. +- To drive an app that defaults to native Wayland, force its X11 backend at launch where the toolkit allows it: `QT_QPA_PLATFORM=xcb` (Qt), `GDK_BACKEND=x11` (GTK), `--ozone-platform=x11` (Chromium/Electron). +- If the target cannot be put on X11, desktop-control cannot see it -- fall back to **agent-browser** (web/Electron) or **true-input** (terminal emulators). + +## Semantic layer (AT-SPI) reliability + +AT-SPI trees can collapse: the registry's `GetChildren` may time out, and Qt apps can render as a single root node even with `QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1`. When `get_window_state` returns a near-empty tree: + +```bash +cua-driver config set capture_mode vision # screenshot-only snapshots +``` + +and work the pixel path (`click '{"pid":N,"window_id":W,"x":X,"y":Y}'`) against the returned PNG. Don't burn turns re-snapshotting hoping the tree fills in -- on this tier, pixel-first is a legitimate default. + +## The toolkit boundary: synthetic input is silently dropped by Qt and GTK4 + +XSendEvent marks events with the `send_event` flag, and major toolkits **ignore flagged input entirely**. Verified on v0.5.1: Qt apps (kcalc) and GTK4 apps (zenity) no-op on *every* action -- pixel clicks, `press_key`, `type_text` -- while the driver reports success. There is no error to catch; only the post-action snapshot reveals it. + +Practical consequence: the Act stage only works against apps that honor synthetic events (verified: winit-based apps like alacritty; generally simpler/older X11 toolkits). **Probe before committing to a workflow**: send one cheap keystroke, re-snapshot, and check it rendered. If the target ignores synthetic input, desktop-control cannot act on it on this tier -- Observe (screenshots, window enumeration) still works, but route the interaction through **agent-browser** (web/Electron) or **true-input** (terminal) instead. + +## Text input is lossy even where it lands + +In apps that do accept synthetic input, typing drops and mangles characters: shifted symbols can inject as their unshifted key (`*` arriving as `8`), trailing characters get dropped (verified: `type_text "echo ok42"` rendered `echo ok4`), and `type_text_chars` with generous per-char delays still loses keystrokes. `hotkey` chords (including paste shortcuts) and middle-click paste do **not** land reliably, so the clipboard is not a workaround here. + +What works: short bursts plus verification. After every `type_text`, re-snapshot, compare the rendered text against what you sent, and repair the diff (`press_key` backspace, retype the missing tail). On Linux the post-action screenshot is not a formality -- it is the only way to know what actually arrived. + +## Failure modes + +| Symptom | Fix | +|---|---| +| Expected window missing from `list_windows` | Native-Wayland target -- relaunch it on the X11 backend (`QT_QPA_PLATFORM=xcb` / `GDK_BACKEND=x11` / `--ozone-platform=x11`) | +| Tree is a single root node / AT-SPI timeouts | `capture_mode vision` + pixel actions | +| Every action "succeeds" but nothing changes | Toolkit drops `send_event` input (Qt, GTK4) -- target is unreachable on this tier; use agent-browser or true-input for the interaction | +| Typed text arrives mangled or truncated | Verify-and-repair loop: re-snapshot, diff rendered text, backspace + retype the tail | +| `doctor` reports no DISPLAY | Run from the graphical session (or export the session's `DISPLAY`/`XAUTHORITY`), not a bare TTY/SSH context | + +Deep mechanics live in the upstream pack: `~/.cua-driver/skills/cua-driver/LINUX.md`. diff --git a/plugins/droid-control/skills/desktop-control/platforms/macos.md b/plugins/droid-control/skills/desktop-control/platforms/macos.md new file mode 100644 index 0000000..e7c1645 --- /dev/null +++ b/plugins/droid-control/skills/desktop-control/platforms/macos.md @@ -0,0 +1,64 @@ +# Desktop-Control: macOS + +cua-driver on macOS posts events per-pid through Accessibility (AX) and captures via ScreenCaptureKit. Both are gated by TCC, and TCC attributes grants to the **app bundle that asks** -- which is why every flow below routes through `CuaDriver.app` instead of your terminal. + +## Permissions (TCC) + +```bash +cua-driver permissions grant # LaunchServices-routed: the Accessibility + Screen Recording dialogs + # attribute to com.trycua.driver, then it confirms the driver's own status +cua-driver permissions status # read-only via the daemon; reports `unknown` when no daemon is up +``` + +Do not grant by clicking through System Settings for your terminal app -- the daemon runs under the bundle identity, and terminal-attributed grants do nothing for it. The first real screen capture may trigger one extra consent sheet; accept it. + +## Daemon launch + +Launch from the logged-in GUI session so the daemon attaches to it with the bundle's TCC identity: + +```bash +open -n -g -a CuaDriver --args serve +cua-driver status +cua-driver stop +``` + +SSH-launched bare binaries often miss the GUI session and their AX/capture probes hang. (`cua-driver mcp` and CLI tool calls auto-proxy to a properly attributed daemon when one is reachable.) + +## Patterns + +**Reliable terminal command entry** -- when `type_text` or raw key posting drops characters in Terminal-class apps, route through the pasteboard: + +```bash +printf '%s' 'your command' | pbcopy +cua-driver hotkey '{"pid":,"window_id":,"keys":["cmd","v"]}' +cua-driver press_key '{"pid":,"window_id":,"key":"return"}' +``` + +**Native security / modal sheets** (SecurityAgent, Keychain prompts, auth dialogs) -- these often report `is_on_screen: false` even while visible. Locate by process, then enumerate everything: + +```bash +pgrep -fl SecurityAgent +cua-driver list_windows '{"pid":,"on_screen_only":false}' +cua-driver get_window_state '{"pid":,"window_id":}' +``` + +Only enter credentials in environments you own and were explicitly authorized to drive. + +**Menu commands / app shortcuts** -- pass `window_id` so AppKit routes the key equivalent to the target app instead of the frontmost one: + +```bash +cua-driver hotkey '{"pid":835,"window_id":79,"keys":["cmd","q"]}' +``` + +**Backgrounded / off-space windows** -- the driver acts on `(pid, window_id)` without raising. Enumerate with `on_screen_only: false` and target directly. + +## Failure modes + +| Symptom | Fix | +|---|---| +| AX write fails (`AXPress` returns `-25204`) on a system sheet | Fall back to `press_key` / `hotkey` / pixel `click` | +| ScreenCaptureKit error (e.g. SCK `-3801`) in `som`/`vision` capture | `cua-driver config set capture_mode ax` (tree-only, skips Screen Recording), or retry | +| Known dialog missing from `list_windows` results | Re-query with `"on_screen_only": false` | +| Probes hang / permissions report `unknown` | Daemon was launched without GUI attribution -- `cua-driver stop`, relaunch via `open -n -g -a CuaDriver --args serve` | + +Deep mechanics (no-foreground forbidden-list, AXMenuBar navigation, SkyLight click dispatch, Apple-Events browser bridge) live in the upstream pack: `~/.cua-driver/skills/cua-driver/MACOS.md`. diff --git a/plugins/droid-control/skills/desktop-control/platforms/windows.md b/plugins/droid-control/skills/desktop-control/platforms/windows.md new file mode 100644 index 0000000..b55ae11 --- /dev/null +++ b/plugins/droid-control/skills/desktop-control/platforms/windows.md @@ -0,0 +1,45 @@ +# Desktop-Control: Windows + +cua-driver on Windows walks UI Automation (UIA) trees and dispatches actions through a layered UIA + `PostMessage` chain -- per-window message posting, not HID synthesis, so the user's foreground app is untouched. + +## Install and daemon + +The upstream installer is per-user (no admin elevation): binary under `%LOCALAPPDATA%\Programs\Cua\cua-driver\bin`, data and skill pack under `%USERPROFILE%\.cua-driver`, and an autostart task (`cua-driver autostart status|kick|disable`) registered for the daemon. + +```powershell +cua-driver doctor +cua-driver serve # required for element_index workflows +cua-driver status +cua-driver stop +``` + +`cua-driver permissions` is a no-op surface on Windows (TCC is a macOS concept) -- there is no grant dance. The real constraint is **Session 0 isolation**: anything launched by a service (including some SSH daemons) lives in a session with no interactive desktop, where window enumeration returns nothing. Tool calls auto-proxy to an interactive-session daemon when one is reachable; if results come back empty, confirm the daemon was started from the logged-in interactive session, not a service context. + +## JSON quoting (the PowerShell 5.1 footgun) + +Windows PowerShell 5.1 strips quotes around JSON field names in multi-field arguments, so positional JSON fails to parse. Pipe via stdin, or use PowerShell 7+ (`pwsh`): + +```powershell +'{"pid":1234,"window_id":5678}' | cua-driver get_window_state +``` + +From `cmd.exe`, escape inner quotes instead: `cua-driver get_window_state "{\"pid\":1234,\"window_id\":5678}"`. + +## Patterns + +**UWP / packaged apps** -- Store apps (Calculator, Settings) are hosted by `ApplicationFrameHost.exe`, so the visible window's pid is the host's, not the app process's. If `list_windows` against the app's own pid comes up empty, enumerate `ApplicationFrameHost.exe`'s windows and match by title. Classic Win32 apps (Notepad, Explorer) own their windows directly. + +**Minimized windows** -- `get_window_state` and element-index actions work in place, but `press_key` commits silently no-op (no message pump focus). Use `set_value` or element-index-click the commit-equivalent button instead. + +**Browsers / Electron** -- prefer **agent-browser**. If you must stay in desktop-control, launch the browser with `--remote-debugging-port=` and export `CUA_DRIVER_CDP_PORT=` so `execute_javascript` / `query_dom` can attach; UIA covers `get_text` either way. + +## Failure modes + +| Symptom | Fix | +|---|---| +| `UIA invoke failed` on an element | Try `click` with an explicit `action` (`show_menu`, `confirm`, ...) or fall through to a pixel click on the element's center | +| Empty window lists, blank screenshots | Session 0 daemon -- restart `cua-driver serve` from the interactive desktop session | +| Positional JSON "did not parse" errors | PowerShell 5.1 quote-stripping -- pipe JSON via stdin or use `pwsh` | +| Target window not under the app's pid | UWP hosting -- enumerate `ApplicationFrameHost.exe` windows | + +Deep mechanics (UIA tree semantics, click-dispatch layering, focus-steal vectors, UAC boundaries) live in the upstream pack: `~/.cua-driver/skills/cua-driver/WINDOWS.md`. diff --git a/plugins/droid-control/skills/droid-control/SKILL.md b/plugins/droid-control/skills/droid-control/SKILL.md index dabc9e1..0c9b006 100644 --- a/plugins/droid-control/skills/droid-control/SKILL.md +++ b/plugins/droid-control/skills/droid-control/SKILL.md @@ -35,9 +35,10 @@ Three independent lookups. Do all three, then load the union of skills they prod | Other terminal TUI | tuistory backend via `${DROID_PLUGIN_ROOT}/bin/tctl` | | Other terminal TUI (real terminal proof) | **true-input** | | Web page or Electron app | **agent-browser** | +| Native desktop GUI app | **desktop-control** | | Raw terminal byte sequences | **true-input** + **pty-capture** | -**tuistory** is the default for terminal work. Use **true-input** only when you need real terminal rendering evidence. +**tuistory** is the default for terminal work. Use **true-input** only when you need real terminal rendering evidence. On Linux, desktop-control rides upstream's pre-release tier -- its platform file documents the Wayland/AT-SPI/input caveats and when to fall back to **agent-browser** or **true-input**. ### 2. Stage route — what does the workflow need? @@ -157,7 +158,7 @@ For before/after comparison demos, launch both capture workers simultaneously: ## Shared tooling -Terminal drivers use the unified `tctl` wrapper. agent-browser has its own CLI and does not use `tctl`. +Terminal drivers use the unified `tctl` wrapper. agent-browser and desktop-control have their own CLIs (`agent-browser`, `cua-driver`) and do not use `tctl`. Drivers can be combined in one workflow — e.g., `tctl` for a CLI and `agent-browser` for a web UI it interacts with. @@ -170,6 +171,7 @@ Drivers can be combined in one workflow — e.g., `tctl` for a CLI and `agent-br | true-input | Windows (KVM) | `libvirt`, `qemu`, KVM VM with SPICE + SSH, `DROID_VM_*` env vars | `virt-manager` | | true-input | macOS (QEMU) | `qemu`, `socat`, macOS VM with SSH, `DROID_MAC_*` env vars | — | | agent-browser | All | `agent-browser` (+ `agent-browser install`) | — | +| desktop-control | All | `cua-driver` (+ daemon via `cua-driver serve`; macOS also `cua-driver permissions grant`) | upstream skill pack (`cua-driver skills install`) | | compose | All | `ffmpeg`, `ffprobe`, `agg` | — | | showcase | All | Node.js (>= 18), Chrome/Chromium | — | @@ -188,6 +190,10 @@ sudo apt-get install -y grim wf-recorder # optional: screenshots + v # agent-browser driver agent-browser install # one-time: downloads bundled Chromium +# desktop-control driver (Windows hosts: irm .../scripts/install.ps1 | iex) +curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash +cua-driver skills install # upstream skill pack (deep tool reference) + # compose + showcase (video rendering) sudo apt-get install -y ffmpeg # video processing (includes ffprobe) cd ${DROID_PLUGIN_ROOT}/remotion && npm install # Remotion dependencies