sandbox-agent/foundry/research/friction/sandbox-agent.mdx
2026-03-25 12:23:14 -07:00

69 lines
3.4 KiB
Text

# Sandbox Agent Friction Log
## 2026-02-17 - uncommitted
### What I Was Working On
Stabilizing Daytona-backed Codex task initialization (`init_create_session`) and diagnosing repeated sandbox-agent `session/new` failures.
### Friction / Issue
Two issues compounded each other:
1. The backend added a local `45s` Promise timeout around `sandbox-agent` SDK `createSession()`, but the underlying ACP call is not abortable. Timed-out calls kept running in the background while retries started new session creates, causing overlapping ACP requests and noisy failures.
2. Daytona sandboxes were missing `node`/`npm`/`npx`, while the installed Codex ACP launcher is `npx @zed-industries/codex-acp`. Session initialization could hang/time out because the launcher dependency chain was incomplete.
### Attempted Fix / Workaround
1. Removed the local `45s` timeout wrapper around `SandboxAgent.createSession()` in backend integration.
2. Updated sandbox-instance retry classification to avoid immediate retries for timeout/504 failures, while still retrying quick transient transport failures (502/503/connection reset/refused).
3. Kept Daytona on published `sandbox-agent 0.2.0` and set `SANDBOX_AGENT_ACP_REQUEST_TIMEOUT_MS` via backend env override (`HF_SANDBOX_AGENT_ACP_REQUEST_TIMEOUT_MS`, default `120000`).
4. Updated Daytona bootstrap to install `nodejs` + `npm` (and validate `npx` availability) so `codex-acp` launcher can run.
### Outcome
- `createSession` no longer races itself due local timeout.
- Timeout errors are surfaced directly instead of hidden behind repeated local timeout retries.
- Daytona sandboxes keep published sandbox-agent bootstrap with compatible runtime prerequisites for Codex ACP launch.
## 2026-02-08 - uncommitted
### What I Was Working On
Wiring task initialization to create/poll sandbox-agent sessions through provider-resolved endpoints.
### Friction / Issue
Local test runs cannot assume a live sandbox-agent backend, so session bootstrap is inherently optional in tests and on clean machines.
### Attempted Fix / Workaround
1. Wrapped session creation in guarded error handling during task initialization.
2. Persisted task state as `queued` when session creation fails, while keeping sandbox metadata written.
3. Continued status tracking through runtime messages when a session is available.
### Outcome
- Task creation remains deterministic without hard dependency on a running sandbox-agent process.
- Behavior is testable in CI/local environments that do not run sandbox-agent.
## 2026-02-12 - uncommitted
### What I Was Working On
Upgrading backend integration from legacy sandbox-agent session endpoints to `sandbox-agent@0.2.0` and validating Daytona-backed execution.
### Friction / Issue
`0.2.0` no longer exposes the legacy session REST endpoints used by the backend integration; direct session create/status polling via those paths returns `404`.
### Attempted Fix / Workaround
1. Switched backend integration to `sandbox-agent` SDK (`SandboxAgent.connect`, `createSession`, `getSession`, `getEvents`).
2. Added status inference from SDK state/events for compatibility with existing task status sync actor.
3. Upgraded Daytona provider to install/start `sandbox-agent 0.2.0` in sandboxes and expose a preview endpoint for SDK calls.
### Outcome
- Backend no longer depends on removed legacy session REST endpoints.
- Daytona flow is aligned with `sandbox-agent 0.2.0` runtime and SDK usage.