SDK: Add ensureServer() for automatic server recovery (#260)

* SDK sandbox provisioning: built-in providers, docs restructure, and quickstart overhaul - Add built-in sandbox providers (local, docker, e2b, daytona, vercel, cloudflare) to the TypeScript SDK so users import directly instead of passing client instances - Restructure docs: rename architecture to orchestration-architecture, add new architecture page for server overview, improve getting started flow - Rewrite quickstart to be TypeScript-first with provider CodeGroup and custom provider accordion - Update all examples to use new provider APIs - Update persist drivers and foundry for new SDK surface Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix SDK typecheck errors and update persist drivers for insertEvent signature - Fix insertEvent call in client.ts to pass sessionId as first argument - Update Daytona provider create options to use Partial type (image has default) - Update StrictUniqueSessionPersistDriver in tests to match new insertEvent signature - Sync persist packages, openapi spec, and docs with upstream changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add Modal and ComputeSDK built-in providers, update examples and docs - Add `sandbox-agent/modal` provider using Modal SDK with node:22-slim image - Add `sandbox-agent/computesdk` provider using ComputeSDK's unified sandbox API - Update Modal and ComputeSDK examples to use new SDK providers - Update Modal and ComputeSDK deploy docs with provider-based examples - Add Modal to quickstart CodeGroup and docs.json navigation - Add provider test entries for Modal and ComputeSDK - Remove old standalone example files (modal.ts, computesdk.ts) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Modal provider: pre-install agents in image, fire-and-forget exec for server - Pre-install agents in Dockerfile commands so they are cached across creates - Use fire-and-forget exec (no wait) to keep server alive in Modal sandbox - Add memoryMiB option (default 2GB) to avoid OOM during agent install Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Sync upstream changes: multiplayer docs, logos, openapi spec, foundry config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * SDK: Add ensureServer() for automatic server recovery Add ensureServer() to SandboxProvider interface to handle cases where the sandbox-agent server stops or goes to sleep. The SDK now calls this method after 3 consecutive health-check failures, allowing providers to restart the server if needed. Most built-in providers (E2B, Daytona, Vercel, Modal, ComputeSDK) implement this. Docker and Cloudflare manage server lifecycle differently, and Local uses managed child processes. Also update docs for quickstart, architecture, multiplayer, and session persistence; mark persist-* packages as deprecated; and add ensureServer implementations to all applicable providers. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * wip --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 09:01:17 +00:00 · 2026-03-15 20:29:28 -07:00 · 2026-03-15 20:29:28 -07:00 · cf7e2a92c6
commit cf7e2a92c6
parent 3426cbc6ec
112 changed files with 3739 additions and 3537 deletions
--- a/docs/architecture.mdx
+++ b/docs/architecture.mdx
@ -1,64 +1,63 @@
 ---
 title: "Architecture"
-description: "How the client, sandbox, server, and agent fit together."
-icon: "microchip"
+description: "How the Sandbox Agent server, SDK, and agent processes fit together."
 ---

-Sandbox Agent runs as an HTTP server inside your sandbox. Your app talks to it remotely.
+Sandbox Agent is a lightweight HTTP server that runs **inside** a sandbox. It:
+
+- **Agent management**: Installs, spawns, and stops coding agent processes
+- **Sessions**: Routes prompts to agents and streams events back in real time
+- **Sandbox APIs**: Filesystem, process, and terminal access for the sandbox environment

 ## Components

- `Your client`: your app code using the `sandbox-agent` SDK.
- `Sandbox`: isolated runtime (E2B, Daytona, Docker, etc.).
- `Sandbox Agent server`: process inside the sandbox exposing HTTP transport.
- `Agent`: Claude/Codex/OpenCode/Amp process managed by Sandbox Agent.
-
-```mermaid placement="top-right"
-  flowchart LR
-    CLIENT["Sandbox Agent SDK"]
-    SERVER["Sandbox Agent server"]
-    AGENT["Agent process"]
+```mermaid
+flowchart LR
+    CLIENT["Your App"]

    subgraph SANDBOX["Sandbox"]
-      direction TB
-      SERVER --> AGENT
+        direction TB
+        SERVER["Sandbox Agent Server"]
+        AGENT["Agent Process<br/>(Claude, Codex, etc.)"]
+        SERVER --> AGENT
    end

-    CLIENT -->|HTTP| SERVER
+    CLIENT -->|"SDK (HTTP)"| SERVER
 ```

-## Suggested Topology
+- **Your app**: Uses the `sandbox-agent` TypeScript SDK to talk to the server over HTTP.
+- **Sandbox**: An isolated runtime (local process, Docker, E2B, Daytona, Vercel, Cloudflare).
+- **Sandbox Agent server**: A single binary inside the sandbox that manages agent lifecycles, routes prompts, streams events, and exposes filesystem/process/terminal APIs.
+- **Agent process**: A coding agent (Claude Code, Codex, etc.) spawned by the server. Each session maps to one agent process.

-Run the SDK on your backend, then call it from your frontend.
+## What `SandboxAgent.start()` does

-This extra hop is recommended because it keeps auth/token logic on the backend and makes persistence simpler.
+1. **Provision**: The provider creates a sandbox (starts a container, creates a VM, etc.)
+2. **Install**: The Sandbox Agent binary is installed inside the sandbox
+3. **Boot**: The server starts listening on an HTTP port
+4. **Health check**: The SDK waits for `/v1/health` to respond
+5. **Ready**: The SDK returns a connected client

-```mermaid placement="top-right"
-  flowchart LR
-    BROWSER["Browser"]
-    subgraph BACKEND["Your backend"]
-      direction TB
-      SDK["Sandbox Agent SDK"]
-    end
-    subgraph SANDBOX_SIMPLE["Sandbox"]
-      SERVER_SIMPLE["Sandbox Agent server"]
-    end
+For the `local` provider, provisioning is a no-op and the server runs as a local subprocess.

-    BROWSER --> BACKEND
-    BACKEND --> SDK --> SERVER_SIMPLE
+### Server recovery
+
+If the server process stops, the SDK automatically calls the provider's `ensureServer()` after 3 consecutive health-check failures. Most built-in providers implement this. Custom providers can add `ensureServer(sandboxId)` to their `SandboxProvider` object.
+
+## Server HTTP API
+
+See the [HTTP API reference](/api-reference) for the full list of server endpoints.
+
+## Agent installation
+
+Agents are installed lazily on first use. To avoid the cold-start delay, pre-install them:
+
+```bash
+sandbox-agent install-agent --all
 ```

-### Backend requirements
+The `rivetdev/sandbox-agent:0.3.2-full` Docker image ships with all agents pre-installed.

-Your backend layer needs to handle:
+## Production-ready agent orchestration

- **Long-running connections**: prompts can take minutes.
- **Session affinity**: follow-up messages must reach the same session.
- **State between requests**: session metadata and event history must persist across requests.
- **Graceful recovery**: sessions should resume after backend restarts.
-
-We recommend [Rivet](https://rivet.dev) over serverless because actors natively support the long-lived connections, session routing, and state persistence that agent workloads require.
-
-## Session persistence
-
-For storage driver options and replay behavior, see [Persisting Sessions](/session-persistence).
+For production deployments, see [Orchestration Architecture](/orchestration-architecture) for recommended topology, backend requirements, and session persistence patterns.