align docs and contract

This commit is contained in:
Harivansh Rathi 2026-03-26 08:17:07 -04:00 committed by Hari
parent c37589ccf4
commit 14c8956321
10 changed files with 590 additions and 657 deletions

View file

@ -6,73 +6,93 @@ toc: true
# Architecture
## Client-daemon model
## Public model
deskctl uses a client-daemon architecture over Unix sockets. The daemon starts automatically on the first command and keeps the X11 connection alive so repeated calls skip the connection setup overhead.
`deskctl` is a thin, non-interactive X11 control primitive for agent loops.
The public flow is:
Each command opens a new connection to the daemon, sends a single NDJSON request, reads one NDJSON response, and exits.
- diagnose with `deskctl doctor`
- observe with `snapshot`, `list-windows`, and grouped `get` commands
- wait with grouped `wait` commands instead of shell `sleep`
- act with explicit selectors or coordinates
- verify with another read or snapshot
## Wire protocol
The tool stays intentionally narrow. It does not try to be a full desktop shell
or a speculative Wayland abstraction.
## Client-daemon architecture
The CLI talks to an auto-managed daemon over a Unix socket. The daemon keeps
the X11 connection alive so repeated commands stay fast and share the same
session-scoped window identity map.
Each CLI invocation sends one request, reads one response, and exits.
## Runtime contract
Requests and responses are newline-delimited JSON (NDJSON) over a Unix socket.
**Request:**
All commands share the same JSON envelope:
```json
{ "id": "r123456", "action": "snapshot", "annotate": true }
{
"success": true,
"data": {},
"error": null
}
```
**Response:**
For window payloads, the public identity is `window_id`, not an X11 handle.
That keeps the contract backend-neutral even though the current support
boundary is X11-only.
```json
{"success": true, "data": {"screenshot": "/tmp/deskctl-1234567890.png", "windows": [...]}}
```
The complete stable-vs-best-effort policy lives on the
[runtime contract](/runtime-contract) page.
Error responses include an `error` field:
## Sessions and sockets
```json
{ "success": false, "error": "window not found: @w99" }
```
Each session gets its own socket path, PID file, and live window mapping.
## Socket location
Public socket resolution order:
The daemon socket is resolved in this order:
1. `--socket` flag (highest priority)
2. `$DESKCTL_SOCKET_DIR/{session}.sock`
3. `$XDG_RUNTIME_DIR/deskctl/{session}.sock`
1. `--socket`
2. `DESKCTL_SOCKET_DIR/{session}.sock`
3. `XDG_RUNTIME_DIR/deskctl/{session}.sock`
4. `~/.deskctl/{session}.sock`
PID files are stored alongside the socket.
Most users should let `deskctl` manage this automatically. `--session` is the
main public knob when you need isolated daemon instances.
## Sessions
## Diagnostics and failure handling
Multiple isolated daemon instances can run simultaneously using the `--session` flag:
`deskctl doctor` runs before daemon startup and checks:
```sh
deskctl --session workspace1 snapshot
deskctl --session workspace2 snapshot
```
- display/session setup
- X11 connectivity
- basic window enumeration
- screenshot viability
- socket directory and stale-socket health
Each session has its own socket, PID file, and window ref map.
Selector and wait failures are structured in `--json` mode so clients can
recover without scraping text.
## Backend design
## Backend notes
The core is built around a `DesktopBackend` trait. The current implementation uses `x11rb` for X11 protocol operations and `enigo` for input simulation.
The backend is built around a `DesktopBackend` trait and currently ships with
an X11 implementation backed by `x11rb`.
The trait-based design means adding Wayland support is a single trait implementation with no changes to the core, CLI, or daemon code.
The important public guarantee is not "portable desktop automation." The
important guarantee is "a correct and unsurprising Linux X11 runtime contract."
## X11 integration
## X11 support boundary
Window detection uses EWMH properties:
This phase supports Linux X11 only.
| Property | Purpose |
| --------------------------- | ------------------------ |
| `_NET_CLIENT_LIST_STACKING` | Window stacking order |
| `_NET_ACTIVE_WINDOW` | Currently focused window |
| `_NET_WM_NAME` | Window title (UTF-8) |
| `_NET_WM_STATE_HIDDEN` | Minimized state |
| `_NET_CLOSE_WINDOW` | Graceful close |
| `WM_CLASS` | Application class/name |
That means:
Falls back to `XQueryTree` if `_NET_CLIENT_LIST_STACKING` is unavailable.
- EWMH/window-manager properties matter
- monitor naming and some ordering details are best-effort
- Wayland and Hyprland are out of scope for the current contract
The runtime documents those boundaries explicitly instead of pretending the
surface is broader than it is.

View file

@ -6,167 +6,101 @@ toc: true
# Commands
## Snapshot
Capture a screenshot and get the window tree:
## Observe
```sh
deskctl doctor
deskctl snapshot
deskctl snapshot --annotate
```
With `--annotate`, colored bounding boxes and `@wN` labels are drawn on the screenshot. Each window gets a unique color from an 8-color palette. Minimized windows are skipped.
The screenshot is saved to `/tmp/deskctl-{timestamp}.png`.
## Click
Click the center of a window by ref, or click exact coordinates:
```sh
deskctl click @w1
deskctl click 960,540
```
## Double click
```sh
deskctl dblclick @w1
deskctl dblclick 500,300
```
## Type
Type a string into the focused window:
```sh
deskctl type "hello world"
```
## Press
Press a single key:
```sh
deskctl press enter
deskctl press tab
deskctl press escape
```
Supported key names: `enter`, `tab`, `escape`, `backspace`, `delete`, `space`, `up`, `down`, `left`, `right`, `home`, `end`, `pageup`, `pagedown`, `f1`-`f12`, or any single character.
## Hotkey
Send a key combination. List modifier keys first, then the target key:
```sh
deskctl hotkey ctrl c
deskctl hotkey ctrl shift t
deskctl hotkey alt f4
```
Modifier names: `ctrl`, `alt`, `shift`, `super` (also `meta` or `win`).
## Mouse move
Move the cursor to absolute coordinates:
```sh
deskctl mouse move 100 200
```
## Mouse scroll
Scroll the mouse wheel. Positive values scroll down, negative scroll up:
```sh
deskctl mouse scroll 3
deskctl mouse scroll -5
deskctl mouse scroll 3 --axis horizontal
```
## Mouse drag
Drag from one position to another:
```sh
deskctl mouse drag 100 200 500 600
```
## Focus
Focus a window by ref or by name (case-insensitive substring match):
```sh
deskctl focus @w1
deskctl focus "firefox"
```
## Close
Close a window gracefully:
```sh
deskctl close @w2
deskctl close "terminal"
```
## Move window
Move a window to an absolute position:
```sh
deskctl move-window @w1 0 0
deskctl move-window "firefox" 100 100
```
## Resize window
Resize a window:
```sh
deskctl resize-window @w1 1280 720
```
## List windows
List all windows without taking a screenshot:
```sh
deskctl list-windows
```
## Get screen size
```sh
deskctl screenshot
deskctl screenshot /tmp/screen.png
deskctl get active-window
deskctl get monitors
deskctl get version
deskctl get systeminfo
deskctl get-screen-size
```
## Get mouse position
```sh
deskctl get-mouse-position
```
## Screenshot
`doctor` checks the runtime before daemon startup. `snapshot` produces a
screenshot plus window refs. `list-windows` is the same window tree without the
side effect of writing a screenshot.
Take a screenshot without the window tree. Optionally specify a save path:
## Wait
```sh
deskctl screenshot
deskctl screenshot /tmp/my-screenshot.png
deskctl screenshot --annotate
deskctl wait window --selector 'title=Firefox' --timeout 10
deskctl wait focus --selector 'id=win3' --timeout 5
deskctl --json wait window --selector 'class=firefox' --poll-ms 100
```
## Launch
Wait commands return the matched window payload on success. In `--json` mode,
timeouts and selector failures expose structured `kind` values.
Launch an application:
## Act on a window
```sh
deskctl launch firefox
deskctl launch code --args /path/to/project
deskctl focus @w1
deskctl focus 'title=Firefox'
deskctl click @w1
deskctl click 960,540
deskctl dblclick @w2
deskctl close @w3
deskctl move-window @w1 100 120
deskctl resize-window @w1 1280 720
```
Selector-driven actions accept refs, explicit selector modes, or absolute
coordinates where appropriate.
## Input and mouse
```sh
deskctl type "hello world"
deskctl press enter
deskctl hotkey ctrl shift t
deskctl mouse move 100 200
deskctl mouse scroll 3
deskctl mouse scroll 3 --axis horizontal
deskctl mouse drag 100 200 500 600
```
Supported key names include `enter`, `tab`, `escape`, `backspace`, `delete`,
`space`, arrow keys, paging keys, `f1` through `f12`, and any single
character.
## Launch
```sh
deskctl launch firefox
deskctl launch code -- --new-window
```
## Selectors
Prefer explicit selectors when the target matters:
```sh
ref=w1
id=win1
title=Firefox
class=firefox
focused
```
Legacy shorthand is still supported:
```sh
@w1
w1
win1
```
Bare strings like `firefox` are fuzzy matches. They resolve when there is one
match and fail with candidate windows when there are multiple matches.
## Global options
| Flag | Env | Description |
@ -174,3 +108,6 @@ deskctl launch code --args /path/to/project
| `--json` | | Output as JSON |
| `--socket <path>` | `DESKCTL_SOCKET` | Path to daemon Unix socket |
| `--session <name>` | | Session name for multiple daemons (default: `default`) |
`deskctl` manages the daemon automatically. Most users never need to think
about it beyond `--session` and `--socket`.

View file

@ -8,17 +8,49 @@ import DocLayout from "../layouts/DocLayout.astro";
<img src="/favicon.svg" alt="" width="40" height="40" />
</header>
<p>
Desktop control CLI for AI agents on Linux X11. Compact JSON output for
agent loops. Screenshot, click, type, scroll, drag, and manage windows
through a fast client-daemon architecture. 100% native Rust.
<p class="tagline">non-interactive desktop control for AI agents</p>
<div class="badges" aria-label="package and runtime badges">
<a href="https://www.npmjs.com/package/deskctl-cli">
<img
src="https://img.shields.io/npm/v/deskctl-cli?label=npm"
alt="npm version badge"
/>
</a>
<a href="https://github.com/harivansh-afk/deskctl/releases">
<img
src="https://img.shields.io/github/v/release/harivansh-afk/deskctl?label=release"
alt="github release badge"
/>
</a>
<img
src="https://img.shields.io/badge/runtime-linux--x11-111827"
alt="linux x11 runtime badge"
/>
<a href="https://www.npmjs.com/package/deskctl-cli">
<img
src="https://img.shields.io/badge/install-npm%20i%20-g%20deskctl--cli-111827"
alt="npm install command badge"
/>
</a>
</div>
<p class="lede">
<code>deskctl</code> is a thin X11 control primitive for agent loops: diagnose
the runtime, observe the desktop, wait for state transitions, act deterministically,
then verify.
</p>
<h2>Getting started</h2>
<pre><code>npm install -g deskctl-cli
deskctl doctor
deskctl snapshot --annotate</code></pre>
<h2>Start here</h2>
<ul>
<li><a href="/installation">Installation</a></li>
<li><a href="/quick-start">Quick start</a></li>
<li><a href="/runtime-contract">Runtime contract</a></li>
</ul>
<h2>Reference</h2>
@ -28,14 +60,27 @@ import DocLayout from "../layouts/DocLayout.astro";
<li><a href="/architecture">Architecture</a></li>
</ul>
<h2>Agent skill</h2>
<p>
There is also an installable skill for `skills.sh`-style agent runtimes:
</p>
<pre><code>npx skills add harivansh-afk/deskctl -s deskctl</code></pre>
<h2>Links</h2>
<ul>
<li>
<a href="https://www.npmjs.com/package/deskctl-cli">npm package</a>
</li>
<li>
<a href="https://github.com/harivansh-afk/deskctl">GitHub</a>
</li>
<li>
<a href="https://crates.io/crates/deskctl">crates.io</a>
<a href="https://github.com/harivansh-afk/deskctl/releases">
GitHub releases
</a>
</li>
</ul>
</DocLayout>

View file

@ -6,43 +6,68 @@ toc: true
# Installation
## Cargo
## Default install
```sh
cargo install deskctl
npm install -g deskctl-cli
deskctl --help
```
## From source
`deskctl-cli` is the default install path. It installs the `deskctl` command by
downloading the matching GitHub Release asset for the supported runtime target.
## One-shot usage
```sh
npx deskctl-cli --help
```
## Agent skill
For `skills.sh`-style runtimes:
```sh
npx skills add harivansh-afk/deskctl -s deskctl
```
The repo skill lives under `skills/deskctl` and is designed around the same
observe -> wait -> act -> verify loop as the CLI.
## Other install paths
### Nix
```sh
nix run github:harivansh-afk/deskctl -- --help
nix profile install github:harivansh-afk/deskctl
```
### Build from source
```sh
git clone https://github.com/harivansh-afk/deskctl
cd deskctl
cargo build --release
cargo build
```
## Docker (cross-compile for Linux)
Source builds on Linux require:
Build a static Linux binary from any platform:
- Rust 1.75+
- `pkg-config`
- X11 development libraries such as `libx11-dev` and `libxtst-dev`
```sh
docker compose -f docker/docker-compose.yml run --rm build
```
This writes `dist/deskctl-linux-x86_64`.
## Deploy to a remote machine
Copy the binary over SSH when `scp` is not available:
```sh
ssh -p 443 user@host 'cat > ~/deskctl && chmod +x ~/deskctl' < dist/deskctl-linux-x86_64
```
## Requirements
## Runtime requirements
- Linux with an active X11 session
- `DISPLAY` environment variable set (e.g. `DISPLAY=:1`)
- `XDG_SESSION_TYPE=x11`
- A window manager that exposes EWMH properties (`_NET_CLIENT_LIST_STACKING`, `_NET_ACTIVE_WINDOW`)
- `DISPLAY` set to a usable X11 display, such as `DISPLAY=:1`
- `XDG_SESSION_TYPE=x11` or an equivalent X11 session environment
- a window manager or desktop environment that exposes standard EWMH properties
such as `_NET_CLIENT_LIST_STACKING` and `_NET_ACTIVE_WINDOW`
No extra native libraries are needed beyond the standard glibc runtime (`libc`, `libm`, `libgcc_s`).
The binary itself only depends on the standard Linux glibc runtime.
If setup fails, run:
```sh
deskctl doctor
```

View file

@ -6,50 +6,72 @@ toc: true
# Quick start
## Core workflow
The typical agent loop is: snapshot the desktop, interpret the result, act on it.
## Install and diagnose
```sh
# 1. see the desktop
deskctl --json snapshot --annotate
npm install -g deskctl-cli
deskctl doctor
```
# 2. click a window by its ref
deskctl click @w1
Use `deskctl doctor` first. It checks X11 connectivity, basic enumeration,
screenshot viability, and socket health before you start driving the desktop.
# 3. type into the focused window
deskctl type "hello world"
## Observe
# 4. press a key
```sh
deskctl snapshot --annotate
deskctl list-windows
deskctl get active-window
deskctl get monitors
```
Use `snapshot` when you want a screenshot artifact plus window refs. Use
`list-windows` when you only need the current window tree without writing a
screenshot.
## Target windows cleanly
Prefer explicit selectors when you need deterministic targeting:
```sh
ref=w1
id=win1
title=Firefox
class=firefox
focused
```
Legacy refs such as `@w1` still work after `snapshot` or `list-windows`. Bare
strings like `firefox` are fuzzy matches and now fail on ambiguity.
## Wait, act, verify
The core loop is:
```sh
# observe
deskctl snapshot --annotate
# wait
deskctl wait window --selector 'title=Firefox' --timeout 10
# act
deskctl focus 'title=Firefox'
deskctl hotkey ctrl l
deskctl type "https://example.com"
deskctl press enter
# verify
deskctl wait focus --selector 'title=Firefox' --timeout 5
deskctl snapshot
```
The `--annotate` flag draws colored bounding boxes and `@wN` labels on the screenshot so agents can visually identify windows.
The wait commands return the matched window payload on success, so they compose
cleanly into the next action.
## Window refs
## Use `--json` when parsing matters
Every `snapshot` assigns refs like `@w1`, `@w2`, etc. to each visible window, ordered top-to-bottom by stacking order. Use these refs anywhere a selector is expected:
```sh
deskctl click @w1
deskctl focus @w3
deskctl close @w2
```
You can also select windows by name (case-insensitive substring match):
```sh
deskctl focus "firefox"
deskctl close "terminal"
```
## JSON output
Pass `--json` for machine-readable output. This is the primary mode for agent integrations:
```sh
deskctl --json snapshot
```
Every command supports `--json` and uses the same top-level envelope:
```json
{
@ -59,7 +81,7 @@ deskctl --json snapshot
"windows": [
{
"ref_id": "w1",
"xcb_id": 12345678,
"window_id": "win1",
"title": "Firefox",
"app_name": "firefox",
"x": 0,
@ -74,14 +96,8 @@ deskctl --json snapshot
}
```
## Daemon lifecycle
Use `window_id` for stable targeting inside a live daemon session. The exact
text formatting is intentionally compact, but JSON is the parsing contract.
The daemon starts automatically on the first command. It keeps the X11 connection alive so repeated calls are fast. You do not need to manage it manually.
```sh
# check if the daemon is running
deskctl daemon status
# stop it explicitly
deskctl daemon stop
```
The full stable-vs-best-effort contract lives on the
[runtime contract](/runtime-contract) page.

View file

@ -0,0 +1,177 @@
---
layout: ../layouts/DocLayout.astro
title: Runtime contract
toc: true
---
# Runtime contract
This page defines the current public output contract for `deskctl`.
It is intentionally scoped to the current Linux X11 runtime surface. It does
not promise stability for future Wayland or window-manager-specific features.
## JSON envelope
Every command supports `--json` and uses the same top-level envelope:
```json
{
"success": true,
"data": {},
"error": null
}
```
Stable top-level fields:
- `success`
- `data`
- `error`
If `success` is `false`, the command exits non-zero in both text mode and JSON
mode.
## Stable window fields
Whenever a response includes a window payload, these fields are stable:
- `ref_id`
- `window_id`
- `title`
- `app_name`
- `x`
- `y`
- `width`
- `height`
- `focused`
- `minimized`
`window_id` is the public session-scoped identifier for programmatic targeting.
`ref_id` is a short-lived convenience handle from the current ref map.
## Stable grouped reads
`deskctl get active-window`
- stable: `data.window`
`deskctl get monitors`
- stable: `data.count`
- stable: `data.monitors`
Stable per-monitor fields:
- `name`
- `x`
- `y`
- `width`
- `height`
- `width_mm`
- `height_mm`
- `primary`
- `automatic`
`deskctl get version`
- stable: `data.version`
- stable: `data.backend`
`deskctl get systeminfo`
- stable: `data.backend`
- stable: `data.display`
- stable: `data.session_type`
- stable: `data.session`
- stable: `data.socket_path`
- stable: `data.screen`
- stable: `data.monitor_count`
- stable: `data.monitors`
## Stable waits
`deskctl wait window`
`deskctl wait focus`
- stable: `data.wait`
- stable: `data.selector`
- stable: `data.elapsed_ms`
- stable: `data.window`
## Stable selector-driven action fields
When selector-driven actions return resolved window data, these fields are
stable when present:
- `data.ref_id`
- `data.window_id`
- `data.title`
- `data.selector`
This applies to:
- `click`
- `dblclick`
- `focus`
- `close`
- `move-window`
- `resize-window`
## Stable artifact fields
For `snapshot` and `screenshot`:
- stable: `data.screenshot`
When a command also returns windows, `data.windows` uses the stable window
payload documented above.
## Stable structured error kinds
When a command fails with structured JSON data, these error kinds are stable:
- `selector_not_found`
- `selector_ambiguous`
- `selector_invalid`
- `timeout`
- `not_found`
- `window_not_focused` in `data.last_observation.kind` or an equivalent wait
observation payload
Stable structured failure fields include:
- `data.kind`
- `data.selector`
- `data.mode`
- `data.candidates`
- `data.message`
- `data.wait`
- `data.timeout_ms`
- `data.poll_ms`
- `data.last_observation`
## Best-effort fields
These values are useful but environment-dependent and should not be treated as
strict parsing guarantees:
- exact monitor naming conventions
- EWMH/window-manager-dependent ordering details
- cosmetic text formatting in non-JSON mode
- default screenshot file names when no explicit path was provided
- stderr wording outside the structured `kind` classifications above
## Text mode expectations
Text mode is intended to stay compact and follow-up-useful.
The exact whitespace and alignment are not stable. The stable behavioral
expectations are:
- important reads print actionable identifiers or geometry
- selector failures print enough detail to recover without `--json`
- artifact-producing commands print the artifact path
- window listings print both `@wN` refs and `window_id` values
If you need strict parsing, use `--json`.

View file

@ -65,6 +65,23 @@ main {
font-style: italic;
}
.lede {
font-size: 1.05rem;
max-width: 42rem;
}
.badges {
display: flex;
flex-wrap: wrap;
gap: 0.6rem;
margin-bottom: 1.25rem;
}
.badges a,
.badges img {
display: block;
}
header {
display: flex;
align-items: center;
@ -117,6 +134,10 @@ a:hover {
text-decoration-thickness: 2px;
}
img {
max-width: 100%;
}
ul,
ol {
padding-left: 1.25em;