mirror of
https://github.com/harivansh-afk/sandbox-agent.git
synced 2026-04-15 05:02:11 +00:00
* Add task owner git auth proposal and sandbox architecture docs - Add proposal for primary user per task with OAuth token injection for sandbox git operations (.context/proposal-task-owner-git-auth.md) - Document sandbox architecture constraints in CLAUDE.md: single sandbox per task assumption, OAuth token security implications, git auto-auth requirement, and git error surfacing rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add proposals for reverting to queues and rivetkit sandbox resilience - proposal-revert-actions-to-queues.md: Detailed plan for reverting the actions-only pattern back to queues/workflows now that the RivetKit queue.iter() bug is fixed. Lists what to keep (lazy tasks, resolveTaskRepoId, sync override threading, E2B fixes, frontend fixes) vs what to revert (communication pattern only). - proposal-rivetkit-sandbox-resilience.md: Rivetkit sandbox actor changes for handling destroyed/paused sandboxes, keep-alive, and the UNIQUE constraint crash fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(foundry): add manual task owner change via UI dropdown Add an owner dropdown to the Overview tab that lets users reassign task ownership to any organization member. The owner's GitHub credentials are used for git operations in the sandbox. Full-stack implementation: - Backend: changeTaskOwnerManually action on task actor, routed through org actor's changeWorkspaceTaskOwner action, with primaryUser schema columns on both task and org index tables - Client: changeOwner method on workspace client (mock + remote) - Frontend: owner dropdown in right sidebar Overview tab showing org members, with avatar and role display - Shared: TaskWorkspaceChangeOwnerInput type and primaryUser fields on workspace snapshot types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
94 lines
4.3 KiB
Markdown
94 lines
4.3 KiB
Markdown
# Proposal: RivetKit Sandbox Actor Resilience
|
|
|
|
## Context
|
|
|
|
The rivetkit sandbox actor (`src/sandbox/actor.ts`) does not handle the case where the underlying cloud sandbox (e.g. E2B VM) is destroyed while the actor is still alive. This causes cascading 500 errors when the actor tries to call the dead sandbox. Additionally, a UNIQUE constraint bug in event persistence crashes the host process.
|
|
|
|
The sandbox-agent repo (which defines the E2B provider) will be updated separately to use `autoPause` and expose `pause()`/typed errors. This proposal covers the rivetkit-side changes needed to handle those signals.
|
|
|
|
## Changes
|
|
|
|
### 1. Fix `persistObservedEnvelope` UNIQUE constraint crash
|
|
|
|
**File:** `insertEvent` in the sandbox actor's SQLite persistence layer
|
|
|
|
The `sandbox_agent_events` table has a UNIQUE constraint on `(session_id, event_index)`. When the same event is observed twice (reconnection, replay, duplicate WebSocket delivery), the insert throws and crashes the host process as an unhandled rejection.
|
|
|
|
**Fix:** Change the INSERT to `INSERT OR IGNORE` / `ON CONFLICT DO NOTHING`. Duplicate events are expected and harmless — they should be silently deduplicated at the persistence layer.
|
|
|
|
### 2. Handle destroyed sandbox in `ensureAgent()`
|
|
|
|
**File:** `src/sandbox/actor.ts` — `ensureAgent()` function
|
|
|
|
When the provider's `start()` is called with an existing `sandboxId` and the sandbox no longer exists, the provider throws a typed `SandboxDestroyedError` (defined in the sandbox-agent provider contract).
|
|
|
|
`ensureAgent()` should catch this error and check the `onSandboxExpired` config option:
|
|
|
|
```typescript
|
|
// New config option on sandboxActor()
|
|
onSandboxExpired?: "destroy" | "recreate"; // default: "destroy"
|
|
```
|
|
|
|
**`"destroy"` (default):**
|
|
- Set `state.sandboxDestroyed = true`
|
|
- Emit `sandboxExpired` event to all connected clients
|
|
- All subsequent action calls (runProcess, createSession, etc.) return a clear error: "Sandbox has expired. Create a new task to continue."
|
|
- The sandbox actor stays alive (preserves session history, audit log) but rejects new work
|
|
|
|
**`"recreate"`:**
|
|
- Call provider `create()` to provision a fresh sandbox
|
|
- Store new `sandboxId` in state
|
|
- Emit `sandboxRecreated` event to connected clients with a notice that sessions are lost (new VM, no prior state)
|
|
- Resume normal operation with the new sandbox
|
|
|
|
### 3. Expose `pause` action
|
|
|
|
**File:** `src/sandbox/actor.ts` — actions
|
|
|
|
Add a `pause` action that delegates to the provider's `pause()` method. This is user-initiated only (e.g. user clicks "Pause sandbox" in UI to save credits). The sandbox actor should never auto-pause.
|
|
|
|
```typescript
|
|
async pause(c) {
|
|
await c.provider.pause();
|
|
state.sandboxPaused = true;
|
|
c.broadcast("sandboxPaused", {});
|
|
}
|
|
```
|
|
|
|
### 4. Expose `resume` action
|
|
|
|
**File:** `src/sandbox/actor.ts` — actions
|
|
|
|
Add a `resume` action for explicit recovery. Calls `provider.start({ sandboxId: state.sandboxId })` which auto-resumes if paused.
|
|
|
|
```typescript
|
|
async resume(c) {
|
|
await ensureAgent(c); // handles reconnect internally
|
|
state.sandboxPaused = false;
|
|
c.broadcast("sandboxResumed", {});
|
|
}
|
|
```
|
|
|
|
### 5. Keep-alive while sessions are active
|
|
|
|
**File:** `src/sandbox/actor.ts`
|
|
|
|
While the sandbox actor has connected WebSocket clients, periodically extend the underlying sandbox TTL to prevent it from being garbage collected mid-session.
|
|
|
|
- On first client connect: start a keep-alive interval (e.g. every 2 minutes)
|
|
- Each tick: call `provider.extendTimeout(extensionMs)` (the provider maps this to `sandbox.setTimeout()` for E2B)
|
|
- On last client disconnect: clear the interval, let the sandbox idle toward its natural timeout
|
|
|
|
This prevents the common case where a user is actively working but the sandbox expires because the E2B default timeout (5 min) is too short. The `timeoutMs` in create options is the initial TTL; keep-alive extends it dynamically.
|
|
|
|
## Key invariant
|
|
|
|
**Never silently fail.** Every destroyed/expired/error state must be surfaced to connected clients via events. The actor must always tell the UI what happened so the user can act on it. See CLAUDE.md "never silently catch errors" rule.
|
|
|
|
## Dependencies
|
|
|
|
These changes depend on the sandbox-agent provider contract exposing:
|
|
- `pause()` method
|
|
- `extendTimeout(ms)` method
|
|
- Typed `SandboxDestroyedError` thrown from `start()` when sandbox is gone
|
|
- `start()` auto-resuming paused sandboxes via `Sandbox.connect(sandboxId)`
|