sandbox-agent/foundry/research/friction/general.mdx
Nathan Flurry ae191d1ae1
Refactor Foundry GitHub state and sandbox runtime (#247)
* Move Foundry HTTP APIs out of /api/rivet

* Move Foundry HTTP APIs onto /v1

* Fix Foundry Rivet base path and frontend endpoint fallback

* Configure Foundry Rivet runner pool for /v1

* Remove Foundry Rivet runner override

* Serve Foundry Rivet routes directly from Bun

* Log Foundry RivetKit deployment friction

* Add actor display metadata

* Tighten actor schema constraints

* Reset actor persistence baseline

* Remove temporary actor key version prefix

Railway has no persistent volumes so stale actors are wiped on
each deploy. The v2 key rotation is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Cache app workspace actor handle across requests

Every request was calling getOrCreate on the Rivet engine API
to resolve the workspace actor, even though it's always the same
actor. Cache the handle and invalidate on error so retries
re-resolve. This eliminates redundant cross-region round-trips
to api.rivet.dev on every request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add temporary debug logging to GitHub OAuth exchange

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make squashed baseline migrations idempotent

Use CREATE TABLE IF NOT EXISTS and CREATE UNIQUE INDEX IF NOT
EXISTS so the squashed baseline can run against actors that
already have tables from the pre-squash migration sequence.
This fixes the "table already exists" error when org workspace
actors wake up with stale migration journals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "Make squashed baseline migrations idempotent"

This reverts commit 356c146035.

* Fix GitHub OAuth callback by removing retry wrapper

OAuth authorization codes are single-use. The appWorkspaceAction wrapper
retries failed calls up to 20 times, but if the code exchange succeeds
and a later step fails, every retry sends the already-consumed code,
producing "bad_verification_code" from GitHub.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add runner versioning to RivetKit registry

Uses Date.now() so each process start gets a unique version.
This ensures Rivet Cloud migrates actors to the new runner on
deploy instead of routing requests to stale runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add backend request and workspace logging

* Log callback request headers

* Make GitHub OAuth callback idempotent against duplicate requests

Clear oauthState before exchangeCode so duplicate callback requests
fail the state check instead of hitting GitHub with a consumed code.
Marked as HACK — root cause of duplicate HTTP requests is unknown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add temporary header dump on GitHub OAuth callback

Log all request headers on the callback endpoint to diagnose
the source of duplicate requests (Railway proxy, Cloudflare, browser).
Remove once root cause is identified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Defer slow GitHub org sync to workflow queue for fast OAuth callback

Split syncGithubSessionFromToken into a fast path (initGithubSession:
exchange code, get viewer, store token+identity) and a slow path
(syncGithubOrganizations: list orgs/installations, sync workspaces).

completeAppGithubAuth now returns the 302 redirect in ~2s instead of
~18s by enqueuing the org sync to the workspace workflow queue
(fire-and-forget). This eliminates the proxy timeout window that was
causing duplicate callback requests.

bootstrapAppGithubSession (dev-only) still calls the full synchronous
sync since proxy timeouts are not a concern and it needs the session
fully populated before returning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* foundry: async app repo import on org select

* foundry: parallelize app snapshot org reads

* repo: push all current workspace changes

* foundry: update runner version and snapshot logging

* Refactor Foundry GitHub state and sandbox runtime

Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache.

Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI.

Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules.

Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 02:45:07 -07:00

387 lines
19 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# General Friction Log
## 2026-03-13 - uncommitted
### What I Was Working On
Debugging slow GitHub OAuth sign-in in production after deploying backend request logging (d0ed0a4).
### Friction / Issue
Production logs showed two separate HTTP requests (different request IDs, ~9s apart) hitting `GET /v1/auth/github/callback` with the same `code` and `state` parameters. The first request succeeded (`exchangeCode` returned a token) but took ~18s total due to `syncGithubSessionFromToken` making multiple sequential GitHub API calls. The second request arrived while the first was still syncing, passed the oauth state validation (state was never cleared), and attempted `exchangeCode` with the already-consumed code, which GitHub rejected with `bad_verification_code`.
The root cause of the duplicate HTTP request is unknown. It is not `appWorkspaceAction` (no retry logic in the current version), not Railway proxy retry (no such config), and not a frontend double-navigation (the SPA is not involved during the OAuth redirect chain). Best hypothesis is the user refreshing during the ~18s blank page wait, but unconfirmed.
### Attempted Fix / Workaround
1. Made `completeAppGithubAuth` clear `oauthState`/`oauthStateExpiresAt` immediately after validation and before `exchangeCode`, so any duplicate request fails the state check instead of hitting GitHub with a consumed code.
2. Split `syncGithubSessionFromToken` into a fast path (`initGithubSession` — exchange code, get viewer, store token+identity) and a slow path (`syncGithubOrganizations` — list orgs, list installations, sync each workspace).
3. `completeAppGithubAuth` now uses the fast path and enqueues the slow org sync to the workspace workflow queue (`workspace.command.syncGithubSession`, fire-and-forget). The HTTP callback returns a 302 redirect in ~2s instead of ~18s, eliminating the proxy timeout window.
4. The frontend already polls `getAppSnapshot` every 500ms when any org has `syncStatus === "syncing"`, so the deferred sync is transparent to the user.
5. `bootstrapAppGithubSession` (dev-only) still calls the full synchronous `syncGithubSessionFromToken` since proxy timeouts are not a concern in dev and it needs the session fully populated before returning.
### Outcome
- OAuth callback responds in ~2s (exchangeCode + getViewer) instead of ~18s.
- Proxy retry window is eliminated — no duplicate requests should occur.
- Duplicate requests are still guarded by the state-clearing idempotency check.
- Organization data populates asynchronously via the workflow queue; the frontend shows loading state and polls until complete.
- Root cause of the duplicate HTTP request (likely Railway/Cloudflare proxy retry on slow GET) remains uninvestigated but is no longer a practical problem.
## 2026-03-05 - uncommitted
### What I Was Working On
Verifying the BaseUI frontend against the real `rivet-dev/sandbox-agent-testing` repo, creating live PR-backed tasks, and driving the flow through the browser.
### Friction / Issue
Three separate issues stacked together during live verification:
1. A half-created task actor remained in project indexes after earlier runtime failures. The actor state existed, but its durable task row did not, so repo overview polling spammed `Task not found` and kept trying to load an orphaned task.
2. Rebuilding the backend container outside `just dev` dropped injected GitHub auth, which made repo overview fall back to `Open PRs 0` until `GITHUB_TOKEN`/`GH_TOKEN` were passed back into `docker compose`.
3. In the create-task modal, the BaseUI-controlled form looked populated in the browser, but submit gating/click behavior was unreliable under browser automation, making it hard to distinguish frontend state bugs from backend failures.
### Attempted Fix / Workaround
1. Updated project-actor stale task pruning to treat `Task not found:` the same as actor-not-found and rebuilt the backend image.
2. Recovered the orphaned task by forcing an initialize attempt, which surfaced a missing `body?.providerId` guard in the task init workflow and led to pruning the stale project index row.
3. Recreated the backend with `GITHUB_TOKEN="$(gh auth token)" GH_TOKEN="$(gh auth token)" docker compose ... up -d --build backend` so PR sync could see live GitHub data again.
4. Used `agent-browser` plus screenshots to separate working paths (repo overview + PR visibility) from the remaining broken path (modal submit / task creation UI).
### Outcome
- Live repo overview now shows the real `sandbox-agent-testing` PRs again.
- The stale task actor no longer blocks repo overview polling.
- The remaining blocker is narrowed to the frontend create-task interaction path, plus missing agent API credentials for exercising real agent messaging end to end.
## 2026-03-06 - uncommitted
### What I Was Working On
Exercising the live selected-task UI end to end, including session creation, prompt send, and agent response rendering.
### Friction / Issue
The Docker dev backend container was starting on Bun `1.2.23` and accepting TCP connections on `7741`/`7750`, but every HTTP request stalled indefinitely. The same backend code responded immediately when started directly on the host with Bun `1.3.5`, so the hang was specific to the older Bun runtime in `docker/backend.dev.Dockerfile`.
### Attempted Fix / Workaround
1. Verified the stall both from the host and from inside the backend container with `curl`/`fetch`.
2. Started the backend directly on the host on an alternate port to confirm the code path itself was healthy.
3. Updated the dev backend image base from `oven/bun:1.2` to `oven/bun:1.3` so `docker compose` uses the working Bun line.
### Outcome
- Dev-runtime debugging is narrowed from "backend/UI path is broken" to a concrete Docker Bun version issue.
- After rebuild, the next verification step is the real selected-task transcript flow with agent messaging.
## 2026-02-17 - uncommitted
### What I Was Working On
Implementing Daytona snapshot-based sandbox creation and running required workspace validation.
### Friction / Issue
The workspace `node_modules` tree is partially root-owned in this environment. `pnpm install`/cleanup failed with `EACCES` and left missing local tool entrypoints (for example `turbo`/`typescript`), which blocked `pnpm -w typecheck/build/test` from running end-to-end.
### Attempted Fix / Workaround
1. Attempted workspace reinstall (`pnpm install`, `CI=true pnpm install`) and package-level reinstall.
2. Attempted cleanup/recreate of `node_modules`, but root-owned files could not be removed.
3. Added temporary local shims for missing tool entrypoints to continue targeted validation.
### Outcome
- Daytona-specific changes and backend tests were validated.
- Full workspace validation remains blocked until `node_modules` ownership is repaired (or container is recreated).
## 2026-02-16 - uncommitted
### What I Was Working On
Implementing git-spice-backed stack actions and repo overview in the frontend/actors.
### Friction / Issue
The `gs` binary on this environment resolves to Ghostscript (`/usr/bin/gs`), not git-spice. Relying on `gs` directly would execute the wrong tool and silently break stack actions.
### Attempted Fix / Workaround
1. Added git-spice command resolution that tries:
- `HF_GIT_SPICE_BIN` override
- `git-spice`
- `git spice` (git plugin form)
2. Avoided `gs` as a default executable.
3. Added explicit unavailability messaging when git-spice is not installed.
### Outcome
- Stack actions no longer depend on ambiguous `gs` resolution.
- Backend behavior is predictable across environments with/without git-spice installed.
## 2026-02-12 - c2517f2
### What I Was Working On
Fixing Daytona `hf create` failures where `task.attach` would exhaust retries with `Task not found`.
### Friction / Issue
Foundry was using RivetKit's KV-backed durable SQLite VFS via `rivetkit/db/drizzle`, which opens the SQLite DB keyed by `ctx.actorId`. Since actor instances can be rescheduled (new `actorId`) between requests, DB writes from initialization were not visible to later actions (e.g. `attach`), causing “Task not found” errors and action timeouts.
Separately, importing `bun:sqlite` directly broke:
- `tsup` builds (esbuild can't resolve `bun:sqlite` unless externalized)
- `vitest` runs (Vite resolver can't resolve `bun:` specifiers)
### Attempted Fix / Workaround
- Switched backend actor DB provider to a shared on-disk SQLite database at `config.backend.dbPath` using Bun's `bun:sqlite` + Drizzle, with inline migrations and per-connection PRAGMAs.
- Hid Bun-only module resolution behind dynamic imports so `vitest` can load modules.
- Used the KV-backed DB provider only for Node/Vitest environments (tests), while Bun runtime uses the shared on-disk DB.
### Outcome
- Daytona `hf create` now completes and returns a valid session and `daytona://...` target.
- `pnpm -w typecheck`, `pnpm -w build`, and `pnpm -w test` are green.
## 2026-02-09 - uncommitted
### What I Was Working On
Making `hf`/backend Bun-native and integrating OpenTUI without a Node fallback path.
### Friction / Issue
OpenTUI (`@opentui/core`) could not run under Node due Bun-specific imports/assets (`bun:ffi`, `.scm` module loading), which broke `hf` default interactive mode.
### Attempted Fix / Workaround
1. Removed runtime assumptions that backend/CLI would execute under Node.
2. Switched CLI entrypoint and backend launch commands to Bun.
3. Updated docs and tooling guidance to require Bun for runtime execution.
### Outcome
- OpenTUI remains the single TUI path.
- Runtime expectations are explicit: Bun is required for `hf` interactive execution.
## 2026-02-09 - uncommitted
### What I Was Working On
Implementing `hf` backend auto-ensure/auto-restart-on-outdated behavior and adding CLI tests for backend lifecycle logic.
### Friction / Issue
Vitest ESM module namespace exports are non-configurable, so `vi.spyOn(childProcess, "spawn")` failed when testing backend launch behavior.
### Attempted Fix / Workaround
1. Replaced direct `spyOn` with a hoisted `vi.mock("node:child_process", ...)`.
2. Injected mocked `spawn`/`execFileSync` via the module mock.
3. Updated tests to assert lifecycle behavior through the mocked module functions.
### Outcome
- Backend manager tests are stable under ESM.
- Full workspace tests pass with lifecycle coverage for outdated-backend restart behavior.
## 2026-02-08 - uncommitted
### What I Was Working On
Finalizing migration implementation and validation across code, docs, and tests.
### Friction / Issue
The environment did not provide `rg`, and docs/policy files still described Rust-era workflows after runtime migration.
### Attempted Fix / Workaround
1. Switched repository discovery to `find`/`grep`.
2. Rewrote project guidance files (`CLAUDE.md`, `skills/SKILL.md`, docs, `SPEC.md`) to match the TypeScript architecture.
3. Added missing TUI test coverage so workspace-wide test runs no longer fail on packages without tests.
### Outcome
- Full workflow is now documented around TypeScript + pnpm + Turborepo + RivetKit actors.
- Validation pipeline is runnable with one consistent command set.
## 2026-02-08 - uncommitted
### What I Was Working On
Running full workspace test validation (`pnpm -w test`) for the migrated monorepo.
### Friction / Issue
Backend integration tests depend on native `better-sqlite3` bindings, which were unavailable in this environment.
### Attempted Fix / Workaround
1. Attempted `pnpm --filter @sandbox-agent/foundry-backend rebuild better-sqlite3`.
2. Added runtime capability detection in DB-backed backend tests.
3. Marked DB-backed tests with `it.skipIf(!hasBetterSqliteBinding)` so tests run when native bindings exist and skip cleanly otherwise.
### Outcome
- Full workspace test suite passes consistently.
- Backend unit coverage always runs; DB integration tests run automatically on environments with native bindings.
## 2026-02-09 - aab1012 (working tree)
### What I Was Working On
Cleaning up CLI UX noise while validating `hf` flows repeatedly.
### Friction / Issue
Bun emitted a warning on every `hf` invocation due unsupported wildcard `sideEffects` patterns in vendored RivetKit `package.json`.
### Attempted Fix / Workaround
1. Replaced wildcard `sideEffects` array in `packages/rivetkit-vendor/rivetkit/package.json` with `false`.
### Outcome
- Per-command warning spam is gone.
- `hf` command output is now readable during normal usage and smoke testing.
## 2026-02-09 - aab1012 (working tree)
### What I Was Working On
Fixing `hf` launch behavior after `just install` when OpenTUI assets were loaded under Node.
### Friction / Issue
Global launcher resolution depended on pnpm global bin + shell PATH state. In environments where Bun was not on PATH (or where another `hf` shim was used), CLI could execute under Node and fail with:
- `Unknown file extension ".scm"` from `@opentui/core/assets/...`
### Attempted Fix / Workaround
1. Updated `just install` to install a deterministic launcher at `~/.local/bin/hf`.
2. Launcher explicitly resolves Bun from `$HF_BUN` or `~/.bun/bin/bun` (with `command -v bun` fallback).
3. Launcher exits with a clear Bun-required error if Bun is unavailable.
### Outcome
- `hf` runs through Bun consistently after install, independent of pnpm global-bin PATH quirks.
- OpenTUI `.scm` asset load no longer goes through Node.
## 2026-02-09 - aab1012 (working tree)
### What I Was Working On
Eliminating `.scm` loader failures when `hf` is accidentally launched via Node.
### Friction / Issue
Even with Bun-first install scripts, user shells can still invoke `hf` through stale/hash/alias Node-based launch paths, causing OpenTUI asset load failure:
- `ERR_UNKNOWN_FILE_EXTENSION .scm`
### Attempted Fix / Workaround
1. Added CLI bootstrap guard in `packages/cli/src/index.ts`:
- If runtime is not Bun, re-exec with Bun (`$HF_BUN`, `~/.bun/bin/bun`, then `bun` on PATH).
2. Deferred OpenTUI import to dynamic import (`import("./tui.js")`) so Node can reach the bootstrap guard before loading OpenTUI assets.
### Outcome
- `node packages/cli/dist/index.js --help` now works (auto re-execs to Bun).
- `.scm` extension crash path is eliminated even when launcher is Node-based.
## 2026-02-17 - uncommitted
### What I Was Working On
Validating new git-spice stack integration tests under `HF_ENABLE_ACTOR_INTEGRATION_TESTS=1`.
### Friction / Issue
Running backend tests with the integration flag enabled triggered unrelated actor integration suites and produced long noisy failures (`Failed query ...`, `memory access out of bounds`) unrelated to the stack changes, making targeted validation difficult.
### Attempted Fix / Workaround
1. Switched to package-targeted test runs for deterministic coverage (`@sandbox-agent/foundry-backend` + `@sandbox-agent/foundry-frontend`).
2. Relied on required workspace validation (`pnpm -w typecheck`, `pnpm -w build`, `pnpm -w test`) plus targeted stack test files.
3. Stopped the runaway integration run and recorded this friction for follow-up.
### Outcome
- New stack-focused tests pass in deterministic targeted runs.
- Full required workspace checks pass.
- Integration-gated suite remains noisy and needs separate stabilization.
## 2026-03-05 - uncommitted
### What I Was Working On
Reviewing architecture for simplification opportunities.
### Friction / Issue
Considered merging `projectPrSync` (30s) and `projectBranchSync` (5s) into a single `projectSync` actor that polls at the faster cadence and does PR fetches every Nth tick. This would reduce actor count by one per repo but violates the single-responsibility-per-actor pattern established in the codebase. Mixed cadences within one actor add conditional tick logic, make the polling intervals harder to reason about independently, and couple two unrelated data sources (git branches vs GitHub API) into one failure domain.
### Attempted Fix / Workaround
None — rejected the idea during review.
### Outcome
- Keep `projectPrSync` and `projectBranchSync` as separate actors.
- Single-responsibility-per-sync-actor is the right pattern for this codebase.
## 2026-03-06 - 77341ff
### What I Was Working On
Bringing up the Docker-based local dev stack with `just dev` after the BaseUI frontend migration.
### Friction / Issue
Docker Desktop recovered, but the frontend container failed immediately with `Cannot find module @rollup/rollup-linux-arm64-gnu`. The dev compose setup bind-mounted the host workspace into `/app`, so the Linux container picked up macOS `node_modules` and missed Rollup's Linux optional package.
### Attempted Fix / Workaround
1. Confirmed Docker itself was healthy again by checking the Unix socket, `docker version`, and the backend health endpoint.
2. Reproduced the frontend crash inside `docker compose`.
3. Changed the frontend dev service to use named volumes for workspace `node_modules` and the pnpm store, and to run `pnpm install --frozen-lockfile` inside the container before starting Vite.
### Outcome
- Docker engine startup was restored.
- The compose stack no longer depends on host-architecture frontend dependencies.
- `just dev` can proceed to start the backend and Linux-native frontend services cleanly.
## 2026-03-06 - uncommitted
### What I Was Working On
Verifying the selected-task UI flow end to end in the browser: create repo, create task, select the task, start an agent session, and send a follow-up message.
### Friction / Issue
Local dev hit three stacked runtime issues during live UI verification:
1. The frontends Vite proxy and the backend/manager startup order were brittle enough that `/api/rivet/metadata` or the manager port `7750` could briefly hang or refuse connections during restarts, which made browser verification look flaky even when the backend eventually came up.
2. The new local sandbox provider initially persisted only the sandbox-agent endpoint, not its bearer token, so ACP session creation later failed with `401 Token Invalid`.
3. The exported local `OPENAI_API_KEY` / `CODEX_API_KEY` credentials came from local ChatGPT/Codex auth state but did not include the `api.responses.write` scope required by Codex ACP, so the agent session could start but failed when the model tried to answer.
### Attempted Fix / Workaround
1. Added permissive CORS on the backend wrapper and iterated on live browser verification until the wrapper + manager startup sequence was stable again.
2. Updated the local provider to return both sandbox-agent `endpoint` and `token`.
3. Updated `sandbox-instance` to refresh local-provider agent credentials instead of trusting stale persisted metadata across backend restarts.
4. Stopped injecting `OPENAI_API_KEY` / `CODEX_API_KEY` into the host-local sandbox-agent process so local Codex can fall back to machine-native auth instead of the under-scoped exported token.
### Outcome
- The browser flow now reaches the real selected-task transcript screen.
- Task creation and initial session creation work in the UI against the local provider.
- A remaining upstream auth/runtime blocker still prevents a clean verified assistant text response in the final follow-up-message step, so that part of the end-to-end flow is not yet reliable enough to claim complete.