mirror of
https://github.com/harivansh-afk/deskctl.git
synced 2026-04-15 03:00:45 +00:00
Phase 6: utility commands, SKILL.md, AGENTS.md, README.md
- Implement screen_size via xcap Monitor, mouse_position via x11rb query_pointer, standalone screenshot with optional annotation, launch for spawning detached processes - Handler dispatchers for get-screen-size, get-mouse-position, screenshot, launch - SKILL.md agent discovery file with allowed-tools frontmatter - AGENTS.md contributor guidelines for AI agents - README.md with installation, quick start, architecture overview
This commit is contained in:
parent
567115a6c2
commit
03dfd6b6ea
5 changed files with 339 additions and 8 deletions
40
AGENTS.md
Normal file
40
AGENTS.md
Normal file
|
|
@ -0,0 +1,40 @@
|
|||
# Agent Guidelines
|
||||
|
||||
## Build
|
||||
|
||||
```bash
|
||||
cargo build
|
||||
cargo clippy
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
Requires an X11 session with `DISPLAY` set.
|
||||
|
||||
```bash
|
||||
cargo run -- snapshot
|
||||
cargo run -- --json snapshot --annotate
|
||||
```
|
||||
|
||||
## Code Style
|
||||
|
||||
- No emojis in code or comments
|
||||
- Use `anyhow::Result` for all fallible functions
|
||||
- All daemon handler functions are async
|
||||
- Match field naming between CLI args, NDJSON protocol, and JSON output
|
||||
|
||||
## Architecture
|
||||
|
||||
- `src/cli/` - clap CLI parser and client-side socket connection
|
||||
- `src/daemon/` - tokio async daemon, request handler, state management
|
||||
- `src/backend/` - DesktopBackend trait and X11 implementation
|
||||
- `src/core/` - shared types, protocol, ref map, session detection
|
||||
|
||||
## Adding a New Command
|
||||
|
||||
1. Add the variant to `Command` enum in `src/cli/mod.rs`
|
||||
2. Add request building in `build_request()` in `src/cli/mod.rs`
|
||||
3. Add the action handler in `src/daemon/handler.rs`
|
||||
4. Add the backend method to `DesktopBackend` trait in `src/backend/mod.rs`
|
||||
5. Implement in `src/backend/x11.rs`
|
||||
6. Update `SKILL.md`
|
||||
53
README.md
Normal file
53
README.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# desktop-ctl
|
||||
|
||||
Desktop control CLI for AI agents on Linux X11. A single installable binary that gives agents full desktop access: screenshots with window refs, mouse/keyboard input, and window management.
|
||||
|
||||
Inspired by [agent-browser](https://github.com/vercel-labs/agent-browser) - but for the full desktop.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
cargo install desktop-ctl
|
||||
```
|
||||
|
||||
System dependencies (Debian/Ubuntu):
|
||||
```bash
|
||||
sudo apt install libxcb-dev libxrandr-dev libclang-dev
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# See the desktop
|
||||
desktop-ctl snapshot
|
||||
|
||||
# Click a window
|
||||
desktop-ctl click @w1
|
||||
|
||||
# Type text
|
||||
desktop-ctl type "hello world"
|
||||
|
||||
# Focus by name
|
||||
desktop-ctl focus "firefox"
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
Client-daemon architecture over Unix sockets (NDJSON wire protocol). The daemon starts automatically on first command and keeps the X11 connection alive for fast repeated calls.
|
||||
|
||||
```
|
||||
Agent -> desktop-ctl CLI (thin client) -> Unix socket -> desktop-ctl daemon -> X11
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Linux with X11 session
|
||||
- Rust 1.75+ (for building)
|
||||
|
||||
## Wayland Support
|
||||
|
||||
Coming in v0.2. The trait-based backend design means adding Hyprland/Wayland support is a single trait implementation with zero refactoring of the core.
|
||||
|
||||
## License
|
||||
|
||||
MIT OR Apache-2.0
|
||||
116
SKILL.md
Normal file
116
SKILL.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
name: desktop-ctl
|
||||
description: Desktop control CLI for AI agents - screenshot, click, type, window management on Linux X11
|
||||
allowed-tools: Bash(desktop-ctl:*)
|
||||
---
|
||||
|
||||
# desktop-ctl
|
||||
|
||||
Desktop control CLI for AI agents on Linux X11. Provides a unified interface for screenshots, mouse/keyboard input, and window management with compact `@wN` window references.
|
||||
|
||||
## Core Workflow
|
||||
|
||||
1. **Snapshot** to see the desktop and get window refs
|
||||
2. **Act** using refs or coordinates (click, type, focus)
|
||||
3. **Repeat** as needed
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### See the Desktop
|
||||
|
||||
```bash
|
||||
desktop-ctl snapshot # Screenshot + window tree with @wN refs
|
||||
desktop-ctl snapshot --annotate # Screenshot with bounding boxes and labels
|
||||
desktop-ctl snapshot --json # Structured JSON output
|
||||
desktop-ctl list-windows # Window tree without screenshot
|
||||
desktop-ctl screenshot /tmp/s.png # Screenshot only (no window tree)
|
||||
```
|
||||
|
||||
### Click and Type
|
||||
|
||||
```bash
|
||||
desktop-ctl click @w1 # Click center of window @w1
|
||||
desktop-ctl click 500,300 # Click absolute coordinates
|
||||
desktop-ctl dblclick @w2 # Double-click window @w2
|
||||
desktop-ctl type "hello world" # Type text into focused window
|
||||
desktop-ctl press enter # Press a key
|
||||
desktop-ctl hotkey ctrl c # Send Ctrl+C
|
||||
desktop-ctl hotkey ctrl shift t # Send Ctrl+Shift+T
|
||||
```
|
||||
|
||||
### Mouse Control
|
||||
|
||||
```bash
|
||||
desktop-ctl mouse move 500 300 # Move cursor to coordinates
|
||||
desktop-ctl mouse scroll 3 # Scroll down 3 units
|
||||
desktop-ctl mouse scroll -3 # Scroll up 3 units
|
||||
desktop-ctl mouse drag 100 100 500 500 # Drag from (100,100) to (500,500)
|
||||
```
|
||||
|
||||
### Window Management
|
||||
|
||||
```bash
|
||||
desktop-ctl focus @w2 # Focus window by ref
|
||||
desktop-ctl focus "firefox" # Focus window by name (substring match)
|
||||
desktop-ctl close @w3 # Close window gracefully
|
||||
desktop-ctl move-window @w1 100 200 # Move window to position
|
||||
desktop-ctl resize-window @w1 800 600 # Resize window
|
||||
```
|
||||
|
||||
### Utilities
|
||||
|
||||
```bash
|
||||
desktop-ctl get-screen-size # Screen resolution
|
||||
desktop-ctl get-mouse-position # Current cursor position
|
||||
desktop-ctl launch firefox # Launch an application
|
||||
desktop-ctl launch code -- --new-window # Launch with arguments
|
||||
```
|
||||
|
||||
### Daemon
|
||||
|
||||
```bash
|
||||
desktop-ctl daemon start # Start daemon manually
|
||||
desktop-ctl daemon stop # Stop daemon
|
||||
desktop-ctl daemon status # Check daemon status
|
||||
```
|
||||
|
||||
## Global Options
|
||||
|
||||
- `--json` : Output as structured JSON (all commands)
|
||||
- `--session NAME` : Session name for multiple daemon instances (default: "default")
|
||||
- `--socket PATH` : Custom Unix socket path
|
||||
|
||||
## Window Refs
|
||||
|
||||
After `snapshot` or `list-windows`, windows are assigned short refs:
|
||||
- `@w1` is the topmost (usually focused) window
|
||||
- `@w2`, `@w3`, etc. follow z-order (front to back)
|
||||
- Refs reset on each `snapshot` call
|
||||
- Use `--json` to see stable `xcb_id` for programmatic tracking
|
||||
|
||||
## Example Agent Workflow
|
||||
|
||||
```bash
|
||||
# 1. See what's on screen
|
||||
desktop-ctl snapshot --annotate
|
||||
|
||||
# 2. Focus the browser
|
||||
desktop-ctl focus "firefox"
|
||||
|
||||
# 3. Navigate to a URL
|
||||
desktop-ctl hotkey ctrl l
|
||||
desktop-ctl type "https://example.com"
|
||||
desktop-ctl press enter
|
||||
|
||||
# 4. Take a new snapshot to see the result
|
||||
desktop-ctl snapshot
|
||||
```
|
||||
|
||||
## Key Names for press/hotkey
|
||||
|
||||
Modifiers: `ctrl`, `alt`, `shift`, `super`
|
||||
Navigation: `enter`, `tab`, `escape`, `backspace`, `delete`, `space`
|
||||
Arrows: `up`, `down`, `left`, `right`
|
||||
Page: `home`, `end`, `pageup`, `pagedown`
|
||||
Function: `f1` through `f12`
|
||||
Characters: any single character (e.g. `a`, `1`, `/`)
|
||||
|
|
@ -288,22 +288,71 @@ impl super::DesktopBackend for X11Backend {
|
|||
Ok(())
|
||||
}
|
||||
|
||||
// Phase 6: utility stubs
|
||||
|
||||
fn screen_size(&self) -> Result<(u32, u32)> {
|
||||
anyhow::bail!("Utility commands not yet implemented (Phase 6)")
|
||||
let monitors = xcap::Monitor::all().context("Failed to enumerate monitors")?;
|
||||
let monitor = monitors.into_iter().next().context("No monitor found")?;
|
||||
let w = monitor.width().context("Failed to get monitor width")?;
|
||||
let h = monitor.height().context("Failed to get monitor height")?;
|
||||
Ok((w, h))
|
||||
}
|
||||
|
||||
fn mouse_position(&self) -> Result<(i32, i32)> {
|
||||
anyhow::bail!("Utility commands not yet implemented (Phase 6)")
|
||||
let reply = self
|
||||
.conn
|
||||
.query_pointer(self.root)?
|
||||
.reply()
|
||||
.context("Failed to query pointer")?;
|
||||
Ok((reply.root_x as i32, reply.root_y as i32))
|
||||
}
|
||||
|
||||
fn screenshot(&mut self, _path: &str, _annotate: bool) -> Result<String> {
|
||||
anyhow::bail!("Standalone screenshot not yet implemented (Phase 6)")
|
||||
fn screenshot(&mut self, path: &str, annotate: bool) -> Result<String> {
|
||||
let monitors = xcap::Monitor::all().context("Failed to enumerate monitors")?;
|
||||
let monitor = monitors.into_iter().next().context("No monitor found")?;
|
||||
|
||||
let mut image = monitor
|
||||
.capture_image()
|
||||
.context("Failed to capture screenshot")?;
|
||||
|
||||
if annotate {
|
||||
let windows = xcap::Window::all().unwrap_or_default();
|
||||
let mut window_infos = Vec::new();
|
||||
let mut ref_counter = 1usize;
|
||||
for win in &windows {
|
||||
let title = win.title().unwrap_or_default();
|
||||
let app_name = win.app_name().unwrap_or_default();
|
||||
if title.is_empty() && app_name.is_empty() {
|
||||
continue;
|
||||
}
|
||||
window_infos.push(crate::core::types::WindowInfo {
|
||||
ref_id: format!("w{ref_counter}"),
|
||||
xcb_id: win.id().unwrap_or(0),
|
||||
title,
|
||||
app_name,
|
||||
x: win.x().unwrap_or(0),
|
||||
y: win.y().unwrap_or(0),
|
||||
width: win.width().unwrap_or(0),
|
||||
height: win.height().unwrap_or(0),
|
||||
focused: win.is_focused().unwrap_or(false),
|
||||
minimized: win.is_minimized().unwrap_or(false),
|
||||
});
|
||||
ref_counter += 1;
|
||||
}
|
||||
annotate_screenshot(&mut image, &window_infos);
|
||||
}
|
||||
|
||||
fn launch(&self, _command: &str, _args: &[String]) -> Result<u32> {
|
||||
anyhow::bail!("Launch not yet implemented (Phase 6)")
|
||||
image.save(path).context("Failed to save screenshot")?;
|
||||
Ok(path.to_string())
|
||||
}
|
||||
|
||||
fn launch(&self, command: &str, args: &[String]) -> Result<u32> {
|
||||
let child = std::process::Command::new(command)
|
||||
.args(args)
|
||||
.stdin(std::process::Stdio::null())
|
||||
.stdout(std::process::Stdio::null())
|
||||
.stderr(std::process::Stdio::null())
|
||||
.spawn()
|
||||
.with_context(|| format!("Failed to launch: {command}"))?;
|
||||
Ok(child.id())
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -25,6 +25,10 @@ pub async fn handle_request(
|
|||
"move-window" => handle_move_window(request, state).await,
|
||||
"resize-window" => handle_resize_window(request, state).await,
|
||||
"list-windows" => handle_list_windows(state).await,
|
||||
"get-screen-size" => handle_get_screen_size(state).await,
|
||||
"get-mouse-position" => handle_get_mouse_position(state).await,
|
||||
"screenshot" => handle_screenshot(request, state).await,
|
||||
"launch" => handle_launch(request, state).await,
|
||||
action => Response::err(format!("Unknown action: {action}")),
|
||||
}
|
||||
}
|
||||
|
|
@ -372,6 +376,75 @@ async fn handle_list_windows(
|
|||
}
|
||||
}
|
||||
|
||||
async fn handle_get_screen_size(state: &Arc<Mutex<DaemonState>>) -> Response {
|
||||
let state = state.lock().await;
|
||||
match state.backend.screen_size() {
|
||||
Ok((w, h)) => Response::ok(serde_json::json!({"width": w, "height": h})),
|
||||
Err(e) => Response::err(format!("Failed: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
async fn handle_get_mouse_position(state: &Arc<Mutex<DaemonState>>) -> Response {
|
||||
let state = state.lock().await;
|
||||
match state.backend.mouse_position() {
|
||||
Ok((x, y)) => Response::ok(serde_json::json!({"x": x, "y": y})),
|
||||
Err(e) => Response::err(format!("Failed: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
async fn handle_screenshot(
|
||||
request: &Request,
|
||||
state: &Arc<Mutex<DaemonState>>,
|
||||
) -> Response {
|
||||
let annotate = request
|
||||
.extra
|
||||
.get("annotate")
|
||||
.and_then(|v| v.as_bool())
|
||||
.unwrap_or(false);
|
||||
let path = request
|
||||
.extra
|
||||
.get("path")
|
||||
.and_then(|v| v.as_str())
|
||||
.map(|s| s.to_string())
|
||||
.unwrap_or_else(|| {
|
||||
let ts = std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_millis();
|
||||
format!("/tmp/desktop-ctl-{ts}.png")
|
||||
});
|
||||
let mut state = state.lock().await;
|
||||
match state.backend.screenshot(&path, annotate) {
|
||||
Ok(saved) => Response::ok(serde_json::json!({"screenshot": saved})),
|
||||
Err(e) => Response::err(format!("Screenshot failed: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
async fn handle_launch(
|
||||
request: &Request,
|
||||
state: &Arc<Mutex<DaemonState>>,
|
||||
) -> Response {
|
||||
let command = match request.extra.get("command").and_then(|v| v.as_str()) {
|
||||
Some(c) => c.to_string(),
|
||||
None => return Response::err("Missing 'command' field"),
|
||||
};
|
||||
let args: Vec<String> = request
|
||||
.extra
|
||||
.get("args")
|
||||
.and_then(|v| v.as_array())
|
||||
.map(|arr| {
|
||||
arr.iter()
|
||||
.filter_map(|v| v.as_str().map(String::from))
|
||||
.collect()
|
||||
})
|
||||
.unwrap_or_default();
|
||||
let state = state.lock().await;
|
||||
match state.backend.launch(&command, &args) {
|
||||
Ok(pid) => Response::ok(serde_json::json!({"pid": pid, "command": command})),
|
||||
Err(e) => Response::err(format!("Launch failed: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_coords(s: &str) -> Option<(i32, i32)> {
|
||||
let parts: Vec<&str> = s.split(',').collect();
|
||||
if parts.len() == 2 {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue