Add desktop runtime management (Xvfb, openbox, dbus), screen capture, mouse/keyboard input, and video streaming via neko binary extracted from the m1k1o/neko container. Includes Docker test rig, TypeScript SDK desktop support, and inspector Desktop tab. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
Desktop Computer Use API Enhancements
Context
Competitive analysis of Daytona, Cloudflare Sandbox SDK, and CUA revealed significant gaps in our desktop computer use API. Both Daytona and Cloudflare have or are building screenshot compression, hotkey combos, mouseDown/mouseUp, keyDown/keyUp, per-component process health, and live desktop streaming. CUA additionally has window management and accessibility trees. We have none of these. This plan closes the most impactful gaps across 7 tasks.
Execution Order
Sprint 1 (parallel, no dependencies): Tasks 1, 2, 3, 4
Sprint 2 (foundational refactor): Task 5
Sprint 3 (parallel, depend on #5): Tasks 6, 7
Task 1: Unify keyboard press with object modifiers
What: Change DesktopKeyboardPressRequest to accept a modifiers object instead of requiring DSL strings like "ctrl+c".
Files:
server/packages/sandbox-agent/src/desktop_types.rs— AddDesktopKeyModifiers { ctrl, shift, alt, cmd }struct (allOption<bool>). Addmodifiers: Option<DesktopKeyModifiers>toDesktopKeyboardPressRequest.server/packages/sandbox-agent/src/desktop_runtime.rs— Modifypress_key_args()(~line 1349) to build xdotool key string from modifiers object. If modifiers present, construct"ctrl+shift+a"style string.cmdmaps tosuper.server/packages/sandbox-agent/src/router.rs— AddDesktopKeyModifiersto OpenAPI schemas list.docs/openapi.json— Regenerate.
Backward compatible: Old {"key": "ctrl+a"} still works. New form: {"key": "a", "modifiers": {"ctrl": true}}.
Test: Unit test that press_key_args("a", Some({ctrl: true, shift: true})) produces ["key", "--", "ctrl+shift+a"]. Integration test with both old and new request shapes.
Task 2: Add mouseDown/mouseUp and keyDown/keyUp endpoints
What: 4 new endpoints for low-level press/release control.
Endpoints:
POST /v1/desktop/mouse/down—xdotool mousedown BUTTON(optional x,y moves first)POST /v1/desktop/mouse/up—xdotool mouseup BUTTONPOST /v1/desktop/keyboard/down—xdotool keydown KEYPOST /v1/desktop/keyboard/up—xdotool keyup KEY
Files:
server/packages/sandbox-agent/src/desktop_types.rs— AddDesktopMouseDownRequest,DesktopMouseUpRequest(x/y optional, button optional),DesktopKeyboardDownRequest,DesktopKeyboardUpRequest(key: String).server/packages/sandbox-agent/src/desktop_runtime.rs— Add 4 public methods following existingclick_mouse()/press_key()patterns.server/packages/sandbox-agent/src/router.rs— Add 4 routes, 4 handlers with utoipa annotations.sdks/typescript/src/client.ts— AddmouseDownDesktop(),mouseUpDesktop(),keyDownDesktop(),keyUpDesktop().docs/openapi.json— Regenerate.
Test: Integration test: mouseDown → mousemove → mouseUp sequence. keyDown → keyUp sequence.
Task 3: Screenshot compression
What: Add format, quality, and scale query params to screenshot endpoints.
Params: format (png|jpeg|webp, default png), quality (1-100, default 85), scale (0.1-1.0, default 1.0).
Files:
server/packages/sandbox-agent/src/desktop_types.rs— AddDesktopScreenshotFormatenum. Addformat,quality,scalefields toDesktopScreenshotQueryandDesktopRegionScreenshotQuery.server/packages/sandbox-agent/src/desktop_runtime.rs— After capturing PNG viaimport, pipe through ImageMagickconvertif format != png or scale != 1.0:convert png:- -resize {scale*100}% -quality {quality} {format}:-. Add arun_command_with_stdin()helper (or modify existingrun_command_output) to pipe bytes into a command's stdin.server/packages/sandbox-agent/src/router.rs— Modify screenshot handlers to pass format/quality/scale, return dynamicContent-Typeheader.sdks/typescript/src/client.ts— UpdatetakeDesktopScreenshot()to accept format/quality/scale.docs/openapi.json— Regenerate.
Dependencies: ImageMagick convert already installed in Docker. Verify WebP delegate availability.
Test: Integration tests: request ?format=jpeg&quality=50, verify Content-Type: image/jpeg and JPEG magic bytes. Verify default still returns PNG. Verify ?scale=0.5 returns a smaller image.
Task 4: Window listing API
What: New endpoint to list open windows.
Endpoint: GET /v1/desktop/windows
Files:
server/packages/sandbox-agent/src/desktop_types.rs— AddDesktopWindowInfo { id, title, x, y, width, height, is_active }andDesktopWindowListResponse.server/packages/sandbox-agent/src/desktop_runtime.rs— Addlist_windows()method using xdotool (already installed):xdotool search --onlyvisible --name ""→ window IDsxdotool getwindowname {id}+xdotool getwindowgeometry {id}per windowxdotool getactivewindow→ is_active flag- Add
parse_window_geometry()helper.
server/packages/sandbox-agent/src/router.rs— Add route, handler, OpenAPI annotations.sdks/typescript/src/client.ts— AddlistDesktopWindows().docs/openapi.json— Regenerate.
No new Docker dependencies — xdotool already installed.
Test: Integration test: start desktop, verify GET /v1/desktop/windows returns 200 with a list (may be empty if no GUI apps open, which is fine).
Task 5: Unify desktop processes into process runtime with owner flag
What: Desktop processes (Xvfb, openbox, dbus) get registered in the general process runtime with an owner field, gaining log streaming, SSE, and unified lifecycle for free.
Files:
-
server/packages/sandbox-agent/src/process_runtime.rs:- Add
ProcessOwnerenum:User,Desktop,System. - Add
RestartPolicyenum:Never,Always,OnFailure. - Add
owner: ProcessOwnerandrestart_policy: Option<RestartPolicy>toProcessStartSpec,ManagedProcess, andProcessSnapshot. - Modify
list_processes()to accept optional owner filter. - Add auto-restart logic in
watch_exit(): if restart_policy is Always (or OnFailure and exit code != 0), re-spawn the process using stored spec. Need to store the originalProcessStartSpeconManagedProcess.
- Add
-
server/packages/sandbox-agent/src/router/types.rs:- Add
ownertoProcessInforesponse. - Add
ProcessListQuery { owner: Option<ProcessOwner> }.
- Add
-
server/packages/sandbox-agent/src/router.rs:- Modify
get_v1_processesto acceptQuery<ProcessListQuery>and filter. - Pass
ProcessRuntimeintoDesktopRuntime::new(). - Add
ProcessOwner,RestartPolicyto OpenAPI schemas.
- Modify
-
server/packages/sandbox-agent/src/desktop_runtime.rs— Major refactor:- Remove
ManagedDesktopChildstruct. DesktopRuntimetakesProcessRuntimeas constructor param.start_xvfb_locked()andstart_openbox_locked()callprocess_runtime.start_process(ProcessStartSpec { owner: Desktop, restart_policy: Some(Always), ... })instead of spawning directly.- Store returned process IDs in state instead of
Childhandles. stopcallsprocess_runtime.stop_process()/kill_process().processes_locked()queries process runtime for desktop-owned processes.- dbus-launch remains a direct one-shot spawn (it's not a long-running process, just produces env vars).
- Remove
-
sdks/typescript/src/client.ts— Addownerfilter option tolistProcesses(). -
docs/openapi.json— Regenerate.
Risks:
- Lock ordering: desktop runtime holds Mutex, process runtime uses RwLock. Release desktop Mutex before calling process runtime, or restructure.
log_pathfield inDesktopProcessInfono longer applies (logs are in-memory now). Remove or deprecate.
Test: Integration: start desktop, GET /v1/processes?owner=desktop returns Xvfb+openbox. GET /v1/processes?owner=user excludes them. Desktop process logs are streamable via GET /v1/processes/{id}/logs?follow=true. Existing desktop lifecycle tests still pass.
Task 6: Screen recording API (ffmpeg x11grab)
What: 6 endpoints for recording the desktop to MP4.
Endpoints:
POST /v1/desktop/recording/start— Start ffmpeg recordingPOST /v1/desktop/recording/stop— Stop recording (SIGTERM → wait → SIGKILL)GET /v1/desktop/recordings— List recordingsGET /v1/desktop/recordings/{id}— Get recording metadataGET /v1/desktop/recordings/{id}/download— Serve MP4 fileDELETE /v1/desktop/recordings/{id}— Delete recording
Files:
- New:
server/packages/sandbox-agent/src/desktop_recording.rs— Recording state, ffmpeg process management.start_recording()spawns ffmpeg via process runtime (owner=Desktop):ffmpeg -f x11grab -video_size WxH -i :99 -c:v libx264 -preset ultrafast -r 30 {path}. Recordings stored in{state_dir}/recordings/. server/packages/sandbox-agent/src/desktop_types.rs— Add recording request/response types.server/packages/sandbox-agent/src/desktop_runtime.rs— Wire recording manager, expose through desktop runtime.server/packages/sandbox-agent/src/router.rs— Add 6 routes + handlers.server/packages/sandbox-agent/src/desktop_install.rs— Addffmpegto dependency detection (soft: only error when recording is requested).docker/runtime/Dockerfileanddocker/test-agent/Dockerfile— Addffmpegto apt-get.sdks/typescript/src/client.ts— Add 6 recording methods.docs/openapi.json— Regenerate.
Depends on: Task 5 (ffmpeg runs as desktop-owned process).
Test: Integration: start desktop → start recording → wait 2s → stop → list → download (verify MP4 magic bytes) → delete.
Task 7: Neko WebRTC desktop streaming + React component
What: Integrate neko for WebRTC desktop streaming, mirroring the ProcessTerminal + Ghostty pattern.
Server side
- New:
server/packages/sandbox-agent/src/desktop_streaming.rs— Manages neko process via process runtime (owner=Desktop). Neko connects to existing Xvfb display, runs GStreamer pipeline for H.264 encoding. server/packages/sandbox-agent/src/router.rs:GET /v1/desktop/stream/ws— WebSocket proxy to neko's internal WebSocket. Upgrade request, bridge bidirectionally.POST /v1/desktop/stream/start/POST /v1/desktop/stream/stop— Lifecycle control.
docker/runtime/Dockerfileanddocker/test-agent/Dockerfile— Add neko binary + GStreamer packages (gstreamer1.0-plugins-base,gstreamer1.0-plugins-good,gstreamer1.0-x,libgstreamer1.0-0). Consider making this an optional Docker stage to avoid bloating the base image.
TypeScript SDK
- New:
sdks/typescript/src/desktop-stream.ts—DesktopStreamSessionclass ported from neko'sbase.ts(~500 lines):- WebSocket for signaling (SDP offer/answer, ICE candidates)
RTCPeerConnectionfor video streamRTCDataChannelfor binary input (mouse: 7 bytes, keyboard: 11 bytes)- Events:
onTrack(stream),onConnect(),onDisconnect(),onError()
sdks/typescript/src/client.ts— AddconnectDesktopStream()returningDesktopStreamSession,buildDesktopStreamWebSocketUrl(),startDesktopStream(),stopDesktopStream().sdks/typescript/src/index.ts— ExportDesktopStreamSession.
React SDK
- New:
sdks/react/src/DesktopViewer.tsx— FollowingProcessTerminal.tsxpattern:Props: client (Pick<SandboxAgent, 'connectDesktopStream'>), height, className, style, onConnect, onDisconnect, onErroruseEffect→client.connectDesktopStream()→ wireonTrackto<video>.srcObject- Capture mouse events on video element → scale coordinates to desktop resolution → send via DataChannel
- Capture keyboard events → send via DataChannel
- Connection state indicator
- Cleanup: close RTCPeerConnection, close WebSocket
sdks/react/src/index.ts— ExportDesktopViewer.
Depends on: Task 5 (neko runs as desktop-owned process).
Test: Server integration: start stream, connect WebSocket, verify signaling messages flow. React: component mounts/unmounts without errors. Full E2E requires browser (manual initially).
Verification
After all tasks:
cargo test— All Rust unit tests passcargo test --test v1_api— All integration tests pass (requires Docker)- Regenerate
docs/openapi.jsonand verify it reflects all new endpoints - Build TypeScript SDK:
cd sdks/typescript && pnpm build - Build React SDK:
cd sdks/react && pnpm build - Manual: start desktop, take JPEG screenshot, list windows, record 5s video, stream desktop via DesktopViewer component