Refactor Foundry GitHub state and sandbox runtime (#247)

* Move Foundry HTTP APIs out of /api/rivet * Move Foundry HTTP APIs onto /v1 * Fix Foundry Rivet base path and frontend endpoint fallback * Configure Foundry Rivet runner pool for /v1 * Remove Foundry Rivet runner override * Serve Foundry Rivet routes directly from Bun * Log Foundry RivetKit deployment friction * Add actor display metadata * Tighten actor schema constraints * Reset actor persistence baseline * Remove temporary actor key version prefix Railway has no persistent volumes so stale actors are wiped on each deploy. The v2 key rotation is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Cache app workspace actor handle across requests Every request was calling getOrCreate on the Rivet engine API to resolve the workspace actor, even though it's always the same actor. Cache the handle and invalidate on error so retries re-resolve. This eliminates redundant cross-region round-trips to api.rivet.dev on every request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add temporary debug logging to GitHub OAuth exchange Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Make squashed baseline migrations idempotent Use CREATE TABLE IF NOT EXISTS and CREATE UNIQUE INDEX IF NOT EXISTS so the squashed baseline can run against actors that already have tables from the pre-squash migration sequence. This fixes the "table already exists" error when org workspace actors wake up with stale migration journals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Make squashed baseline migrations idempotent" This reverts commit 356c146035. * Fix GitHub OAuth callback by removing retry wrapper OAuth authorization codes are single-use. The appWorkspaceAction wrapper retries failed calls up to 20 times, but if the code exchange succeeds and a later step fails, every retry sends the already-consumed code, producing "bad_verification_code" from GitHub. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add runner versioning to RivetKit registry Uses Date.now() so each process start gets a unique version. This ensures Rivet Cloud migrates actors to the new runner on deploy instead of routing requests to stale runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add backend request and workspace logging * Log callback request headers * Make GitHub OAuth callback idempotent against duplicate requests Clear oauthState before exchangeCode so duplicate callback requests fail the state check instead of hitting GitHub with a consumed code. Marked as HACK — root cause of duplicate HTTP requests is unknown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add temporary header dump on GitHub OAuth callback Log all request headers on the callback endpoint to diagnose the source of duplicate requests (Railway proxy, Cloudflare, browser). Remove once root cause is identified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Defer slow GitHub org sync to workflow queue for fast OAuth callback Split syncGithubSessionFromToken into a fast path (initGithubSession: exchange code, get viewer, store token+identity) and a slow path (syncGithubOrganizations: list orgs/installations, sync workspaces). completeAppGithubAuth now returns the 302 redirect in ~2s instead of ~18s by enqueuing the org sync to the workspace workflow queue (fire-and-forget). This eliminates the proxy timeout window that was causing duplicate callback requests. bootstrapAppGithubSession (dev-only) still calls the full synchronous sync since proxy timeouts are not a concern and it needs the session fully populated before returning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * foundry: async app repo import on org select * foundry: parallelize app snapshot org reads * repo: push all current workspace changes * foundry: update runner version and snapshot logging * Refactor Foundry GitHub state and sandbox runtime Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache. Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI. Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules. Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 09:01:17 +00:00 · 2026-03-13 02:45:07 -07:00 · 2026-03-13 02:45:07 -07:00 · ae191d1ae1
commit ae191d1ae1
parent 436eb4a3a3
102 changed files with 3490 additions and 2003 deletions
--- a/foundry/CLAUDE.md
+++ b/foundry/CLAUDE.md
@ -1,9 +1,5 @@
 # Project Instructions

-## Breaking Changes
-
-Do not preserve legacy compatibility. Implement the best current architecture, even if breaking.
-
 ## Language Policy

 Use TypeScript for all source code.
@ -48,16 +44,18 @@ Use `pnpm` workspaces and Turborepo.
 ## Railway Logs

 - Production Foundry Railway logs can be read from a linked workspace with `railway logs --deployment --lines 200` or `railway logs <deployment-id> --deployment --lines 200`.
+- Production deploys should go through `git push` to the deployment branch/workflow. Do not use `railway up` for Foundry deploys.
 - If Railway logs fail because the workspace is not linked to the correct project/service/environment, run:
  `railway link --project 33e3e2df-32c5-41c5-a4af-dca8654acb1d --environment cf387142-61fd-4668-8cf7-b3559e0983cb --service 91c7e450-d6d2-481a-b2a4-0a916f4160fc`
 - That links this directory to the `sandbox-agent` project, `production` environment, and `foundry-api` service.
+- Production proxy chain: `api.sandboxagent.dev` routes through Cloudflare → Fastly/Varnish → Railway. When debugging request duplication, timeouts, or retry behavior, check headers like `cf-ray`, `x-varnish`, `x-railway-edge`, and `cdn-loop` to identify which layer is involved.

 ## Frontend + Client Boundary

 - Keep a browser-friendly GUI implementation aligned with the TUI interaction model wherever possible.
 - Do not import `rivetkit` directly in CLI or GUI packages. RivetKit client access must stay isolated inside `packages/client`.
 - All backend interaction (actor calls, metadata/health checks, backend HTTP endpoint access) must go through the dedicated client library in `packages/client`.
- Outside `packages/client`, do not call backend endpoints directly (for example `fetch(.../api/rivet...)`), except in black-box E2E tests that intentionally exercise raw transport behavior.
+- Outside `packages/client`, do not call backend endpoints directly (for example `fetch(.../v1/rivet...)`), except in black-box E2E tests that intentionally exercise raw transport behavior.
 - GUI state should update in realtime (no manual refresh buttons). Prefer RivetKit push reactivity and actor-driven events; do not add polling/refetch for normal product flows.
 - Keep the mock workbench types and mock client in `packages/shared` + `packages/client` up to date with the frontend contract. The mock is the UI testing reference implementation while backend functionality catches up.
 - Keep frontend route/state coverage current in code and tests; there is no separate page-inventory doc to maintain.
@ -105,9 +103,9 @@ For all Rivet/RivetKit implementation:

 ## Rivet Routing

- Mount RivetKit directly on `/api/rivet` via `registry.handler(c.req.raw)`.
+- Mount RivetKit directly on `/v1/rivet` via `registry.handler(c.req.raw)`.
 - Do not add an extra proxy or manager-specific route layer in the backend.
- Let RivetKit own metadata/public endpoint behavior for `/api/rivet`.
+- Let RivetKit own metadata/public endpoint behavior for `/v1/rivet`.

 ## Workspace + Actor Rules

@ -121,6 +119,14 @@ For all Rivet/RivetKit implementation:
 - Keep strict single-writer ownership: each table/row has exactly one actor writer.
 - Parent actors (`workspace`, `project`, `task`, `history`, `sandbox-instance`) use command-only loops with no timeout.
 - Periodic syncing lives in dedicated child actors with one timeout cadence each.
+- Do not build blocking flows that wait on external systems to become ready or complete. Prefer push-based progression driven by actor messages, events, webhooks, or queue/workflow state changes.
+- Use workflows/background commands for any repo sync, sandbox provisioning, agent install, branch restack/rebase, or other multi-step external work. Do not keep user-facing actions/requests open while that work runs.
+- `send` policy: always `await` the `send(...)` call itself so enqueue failures surface immediately, but default to `wait: false`.
+- Only use `send(..., { wait: true })` for short, bounded mutations that should finish quickly and do not depend on external readiness, polling actors, provider setup, repo/network I/O, or long-running queue drains.
+- Request/action contract: wait only until the minimum resource needed for the client's next step exists. Example: task creation may wait for task actor creation/identity, but not for sandbox provisioning or session bootstrap.
+- Read paths must not force refresh/sync work inline. Serve the latest cached projection, mark staleness explicitly, and trigger background refresh separately when needed.
+- If a workflow needs to resume after some external work completes, model that as workflow state plus follow-up messages/events instead of holding the original request open.
+- Do not rely on retries for correctness or normal control flow. If a queue/workflow/external dependency is not ready yet, model that explicitly and resume from a push/event, instead of polling or retry loops.
 - Actor handle policy:
 - Prefer explicit `get` or explicit `create` based on workflow intent; do not default to `getOrCreate`.
 - Use `get`/`getForId` when the actor is expected to already exist; if missing, surface an explicit `Actor not found` error with recovery context.
@ -142,7 +148,7 @@ For all Rivet/RivetKit implementation:
 - All external service calls (git CLI, GitHub CLI, sandbox-agent HTTP, tmux) must go through the `BackendDriver` interface on the runtime context.
 - Integration tests use `setupTest()` from `rivetkit/test` and are gated behind `HF_ENABLE_ACTOR_INTEGRATION_TESTS=1`.
 - End-to-end testing must run against the dev backend started via `docker compose -f compose.dev.yaml up` (host -> container). Do not run E2E against an in-process test runtime.
-  - E2E tests should talk to the backend over HTTP (default `http://127.0.0.1:7741/api/rivet`) and use real GitHub repos/PRs.
+  - E2E tests should talk to the backend over HTTP (default `http://127.0.0.1:7741/v1/rivet`) and use real GitHub repos/PRs.
  - For Foundry live verification, use `rivet-dev/sandbox-agent-testing` as the default testing repo unless the task explicitly says otherwise.
  - Secrets (e.g. `OPENAI_API_KEY`, `GITHUB_TOKEN`/`GH_TOKEN`) must be provided via environment variables, never hardcoded in the repo.
  - `~/misc/env.txt` and `~/misc/the-foundry.env` contain the expected local OpenAI + GitHub OAuth/App config for dev.