sandbox-agent/research/desktop-streaming-architecture.md
Nathan Flurry 2d8508d6e2 feat: enhance desktop computer-use streaming with neko integration
Improve desktop streaming architecture, add inspector dev tooling,
React DesktopViewer updates, and computer-use documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 02:36:25 -07:00

6.2 KiB

Desktop Streaming Architecture

Decision: neko over GStreamer (direct) and VNC

We evaluated three approaches for streaming the virtual desktop to browser clients:

  1. VNC (noVNC/websockify) - traditional remote desktop
  2. GStreamer WebRTC (direct) - custom GStreamer pipeline in the sandbox agent process
  3. neko - standalone WebRTC streaming server with its own GStreamer pipeline

We chose neko.

Approach comparison

VNC (noVNC)

  • Uses RFB protocol, not WebRTC. Relies on pixel-diff framebuffer updates over WebSocket.
  • Higher latency than WebRTC (no hardware-accelerated codec, no adaptive bitrate).
  • Requires a VNC server (x11vnc or similar) plus websockify for browser access.
  • Input handling is mature but tied to the RFB protocol.
  • No audio support without additional plumbing.

Rejected because: Latency is noticeably worse than WebRTC-based approaches. The pixel-diff approach doesn't scale well at higher resolutions or frame rates. No native audio path.

GStreamer WebRTC (direct)

  • Custom pipeline: ximagesrc -> videoconvert -> vp8enc -> rtpvp8pay -> webrtcbin.
  • Runs inside the sandbox agent Rust process using gstreamer-rs bindings.
  • Requires feature-gating (desktop-gstreamer Cargo feature) and linking GStreamer at compile time.
  • ICE candidate handling is complex: Docker-internal IPs (172.17.x.x) must be rewritten to 127.0.0.1 for host browser connectivity.
  • UDP port range must be constrained via libnice NiceAgent properties to stay within Docker-forwarded ports.
  • Input must be implemented separately (xdotool or custom X11 input injection).
  • No built-in session management, authentication, or multi-client support.

Rejected because: Too much complexity for the sandbox agent to own directly. ICE/NAT traversal bugs are hard to debug. The GStreamer Rust bindings add significant compile-time dependencies. Input handling requires a separate implementation. We built and tested this approach (branch desktop-computer-use, PR #226) and found:

  • Black screen issues due to GStreamer pipeline negotiation failures
  • ICE candidate rewriting fragility across Docker networking modes
  • libnice port range configuration requires accessing internal NiceAgent properties that vary across GStreamer versions
  • No data channel for low-latency input (had to fall back to WebSocket-based input which adds a round trip)

neko (chosen)

  • Standalone Go binary extracted from ghcr.io/m1k1o/neko/base.
  • Has its own GStreamer pipeline internally (same ximagesrc -> vp8enc -> webrtcbin approach, but battle-tested).
  • Provides WebSocket signaling, WebRTC media, and a binary data channel for input, all out of the box.
  • Input via data channel is low-latency (sub-frame, no HTTP round trip). Uses X11 XTEST extension.
  • Multi-session support with noauth provider (each browser tab gets its own session).
  • ICE-lite mode with --webrtc.nat1to1 127.0.0.1 eliminates NAT traversal issues for Docker-to-host.
  • EPR (ephemeral port range) flag constrains UDP ports cleanly.
  • Sandbox agent acts as a thin WebSocket proxy: browser WS connects to sandbox agent, which creates a per-connection neko login session and relays signaling messages bidirectionally.
  • Audio codec support (opus) included for free.

Chosen because: Neko encapsulates all the hard WebRTC/GStreamer/input complexity into a single binary. The sandbox agent only needs to:

  1. Manage the neko process lifecycle (start/stop via the process runtime)
  2. Proxy WebSocket signaling (bidirectional relay, ~60 lines of code)
  3. Handle neko session creation (HTTP login to get a session cookie)

This keeps the sandbox agent's desktop streaming code simple (~300 lines for the manager, ~120 lines for the WS proxy) while delivering production-quality WebRTC streaming with data channel input.

Architecture

Browser                    Sandbox Agent              neko (internal)
  |                            |                          |
  |-- WS /stream/signaling --> |-- WS ws://127.0.0.1:18100/api/ws -->|
  |                            |   (bidirectional relay)  |
  |<-- neko signaling ---------|<-- neko signaling -------|
  |                            |                          |
  |<========= WebRTC (UDP 59000-59100) ==================>|
  |   VP8 video, Opus audio, binary data channel          |
  |                                                       |
  |-- data channel input (mouse/keyboard) --------------->|
  |   (binary protocol: opcode + payload, big-endian)     |

Key points:

  • neko listens on internal port 18100 (not exposed externally).
  • UDP ports 59000-59100 are forwarded through Docker for WebRTC media.
  • --webrtc.icelite + --webrtc.nat1to1 127.0.0.1 means neko advertises 127.0.0.1 as its ICE candidate, so the browser connects to localhost UDP ports directly.
  • --desktop.input.enabled=false disables neko's custom xf86-input driver (not available outside neko's official Docker images). Input falls back to XTEST.
  • Each WebSocket proxy connection creates a fresh neko login session with a unique username to avoid session conflicts when multiple clients connect.

Trade-offs

Concern neko GStreamer direct
Binary size ~30MB additional binary ~0 (uses system GStreamer libs)
Compile-time deps None (external binary) gstreamer-rs crate + GStreamer dev libs
Input latency Sub-frame (data channel) WebSocket round trip
ICE/NAT complexity Handled by neko flags Must implement in Rust
Multi-client Built-in session management Must implement
Maintenance Upstream neko updates Own all the code
Audio Built-in (opus) Must add audio pipeline

The main trade-off is the additional ~30MB binary size from neko. This is acceptable for the Docker-based deployment model where image size is less critical than reliability and development velocity.

References