Refactor Foundry GitHub state and sandbox runtime (#247)

* Move Foundry HTTP APIs out of /api/rivet * Move Foundry HTTP APIs onto /v1 * Fix Foundry Rivet base path and frontend endpoint fallback * Configure Foundry Rivet runner pool for /v1 * Remove Foundry Rivet runner override * Serve Foundry Rivet routes directly from Bun * Log Foundry RivetKit deployment friction * Add actor display metadata * Tighten actor schema constraints * Reset actor persistence baseline * Remove temporary actor key version prefix Railway has no persistent volumes so stale actors are wiped on each deploy. The v2 key rotation is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Cache app workspace actor handle across requests Every request was calling getOrCreate on the Rivet engine API to resolve the workspace actor, even though it's always the same actor. Cache the handle and invalidate on error so retries re-resolve. This eliminates redundant cross-region round-trips to api.rivet.dev on every request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add temporary debug logging to GitHub OAuth exchange Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Make squashed baseline migrations idempotent Use CREATE TABLE IF NOT EXISTS and CREATE UNIQUE INDEX IF NOT EXISTS so the squashed baseline can run against actors that already have tables from the pre-squash migration sequence. This fixes the "table already exists" error when org workspace actors wake up with stale migration journals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Make squashed baseline migrations idempotent" This reverts commit 356c146035. * Fix GitHub OAuth callback by removing retry wrapper OAuth authorization codes are single-use. The appWorkspaceAction wrapper retries failed calls up to 20 times, but if the code exchange succeeds and a later step fails, every retry sends the already-consumed code, producing "bad_verification_code" from GitHub. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add runner versioning to RivetKit registry Uses Date.now() so each process start gets a unique version. This ensures Rivet Cloud migrates actors to the new runner on deploy instead of routing requests to stale runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add backend request and workspace logging * Log callback request headers * Make GitHub OAuth callback idempotent against duplicate requests Clear oauthState before exchangeCode so duplicate callback requests fail the state check instead of hitting GitHub with a consumed code. Marked as HACK — root cause of duplicate HTTP requests is unknown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add temporary header dump on GitHub OAuth callback Log all request headers on the callback endpoint to diagnose the source of duplicate requests (Railway proxy, Cloudflare, browser). Remove once root cause is identified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Defer slow GitHub org sync to workflow queue for fast OAuth callback Split syncGithubSessionFromToken into a fast path (initGithubSession: exchange code, get viewer, store token+identity) and a slow path (syncGithubOrganizations: list orgs/installations, sync workspaces). completeAppGithubAuth now returns the 302 redirect in ~2s instead of ~18s by enqueuing the org sync to the workspace workflow queue (fire-and-forget). This eliminates the proxy timeout window that was causing duplicate callback requests. bootstrapAppGithubSession (dev-only) still calls the full synchronous sync since proxy timeouts are not a concern and it needs the session fully populated before returning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * foundry: async app repo import on org select * foundry: parallelize app snapshot org reads * repo: push all current workspace changes * foundry: update runner version and snapshot logging * Refactor Foundry GitHub state and sandbox runtime Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache. Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI. Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules. Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 05:02:11 +00:00 · 2026-03-13 02:45:07 -07:00 · 2026-03-13 02:45:07 -07:00 · ae191d1ae1
commit ae191d1ae1
parent 436eb4a3a3
102 changed files with 3490 additions and 2003 deletions
--- a/research/acp/friction.md
+++ b/research/acp/friction.md
@ -218,6 +218,26 @@ Update this file continuously during the migration.
 - Status: resolved
 - Links: `server/packages/sandbox-agent/src/router.rs`, `server/packages/sandbox-agent/src/acp_runtime/mod.rs`, `server/packages/sandbox-agent/tests/v1_api/acp_transport.rs`, `docs/advanced/acp-http-client.mdx`

+- Date: 2026-03-13
+- Area: Actor runtime shutdown and draining
+- Issue: Actors can continue receiving or finishing action work after shutdown has started, while actor cleanup clears runtime resources such as the database handle. In RivetKit this can surface as `Database not enabled` from `c.db` even when the actor definition correctly includes `db`.
+- Impact: User requests can fail with misleading internal errors during runner eviction or shutdown, and long-lived request paths can bubble up as HTTP 502/timeout failures instead of a clear retryable stopping/draining signal.
+- Proposed direction: Add a real runner draining state so actors stop receiving traffic before shutdown, and ensure actor cleanup does not clear `#db` until in-flight actions are fully quiesced or aborted. App-side request paths should also avoid waiting inline on long actor workflows when possible.
+- Decision: Open.
+- Owner: Unassigned.
+- Status: open
+- Links: `foundry/packages/backend/src/actors/workspace/app-shell.ts`, `/Users/nathan/rivet/rivetkit-typescript/packages/rivetkit/src/actor/instance/mod.ts`, `/Users/nathan/rivet/rivetkit-typescript/packages/rivetkit/src/drivers/engine/actor-driver.ts`
+
+- Date: 2026-03-12
+- Area: Foundry RivetKit serverless routing on Railway
+- Issue: Moving Foundry from `/api/rivet` to `/v1/rivet` exposed three RivetKit deployment couplings: `serverless.basePath` had to be updated explicitly for metadata/start routes, `configureRunnerPool` could not be used in production because the current Rivet token lacked permission to list datacenters, and wrapping `registry.handler(c.req.raw)` inside Hono route handlers produced unstable serverless runner startup under Railway until `/v1/rivet` was dispatched directly from `Bun.serve`.
+- Impact: `GET /v1/rivet/metadata` initially returned 404, app-shell actor creation failed during OAuth/session bootstrap, and Foundry sign-in blocked on `500` from `/v1/app/snapshot` and `/v1/auth/github/start`.
+- Proposed direction: Treat RivetKit serverless base path as an explicit deployment config when versioning routes, avoid relying on runner-pool auto-configuration unless the production token has the required Rivet control-plane permissions, and prefer direct top-level dispatch for RivetKit serverless routes instead of routing them through higher-level Hono middleware.
+- Decision: Accepted and implemented for Foundry. The backend now sets `serverless.basePath` to `/v1/rivet`, leaves runner-pool config to infrastructure, and serves RivetKit directly from the Bun server for `/v1/rivet`.
+- Owner: Unassigned.
+- Status: resolved
+- Links: `foundry/packages/backend/src/actors/index.ts`, `foundry/packages/backend/src/index.ts`
+
 - Date: 2026-02-10
 - Area: Agent selection contract for ACP bootstrap/session creation
 - Issue: `x-acp-agent` bound agent selection to transport bootstrap, which conflicted with Sandbox Agent meta-session goals where one client can manage sessions across multiple agents.