sandbox-agent/foundry/research/specs/async-action-fixes/01-task-creation-bootstrap-only.md
Nathan Flurry ae191d1ae1
Refactor Foundry GitHub state and sandbox runtime (#247)
* Move Foundry HTTP APIs out of /api/rivet

* Move Foundry HTTP APIs onto /v1

* Fix Foundry Rivet base path and frontend endpoint fallback

* Configure Foundry Rivet runner pool for /v1

* Remove Foundry Rivet runner override

* Serve Foundry Rivet routes directly from Bun

* Log Foundry RivetKit deployment friction

* Add actor display metadata

* Tighten actor schema constraints

* Reset actor persistence baseline

* Remove temporary actor key version prefix

Railway has no persistent volumes so stale actors are wiped on
each deploy. The v2 key rotation is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Cache app workspace actor handle across requests

Every request was calling getOrCreate on the Rivet engine API
to resolve the workspace actor, even though it's always the same
actor. Cache the handle and invalidate on error so retries
re-resolve. This eliminates redundant cross-region round-trips
to api.rivet.dev on every request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add temporary debug logging to GitHub OAuth exchange

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make squashed baseline migrations idempotent

Use CREATE TABLE IF NOT EXISTS and CREATE UNIQUE INDEX IF NOT
EXISTS so the squashed baseline can run against actors that
already have tables from the pre-squash migration sequence.
This fixes the "table already exists" error when org workspace
actors wake up with stale migration journals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "Make squashed baseline migrations idempotent"

This reverts commit 356c146035.

* Fix GitHub OAuth callback by removing retry wrapper

OAuth authorization codes are single-use. The appWorkspaceAction wrapper
retries failed calls up to 20 times, but if the code exchange succeeds
and a later step fails, every retry sends the already-consumed code,
producing "bad_verification_code" from GitHub.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add runner versioning to RivetKit registry

Uses Date.now() so each process start gets a unique version.
This ensures Rivet Cloud migrates actors to the new runner on
deploy instead of routing requests to stale runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add backend request and workspace logging

* Log callback request headers

* Make GitHub OAuth callback idempotent against duplicate requests

Clear oauthState before exchangeCode so duplicate callback requests
fail the state check instead of hitting GitHub with a consumed code.
Marked as HACK — root cause of duplicate HTTP requests is unknown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add temporary header dump on GitHub OAuth callback

Log all request headers on the callback endpoint to diagnose
the source of duplicate requests (Railway proxy, Cloudflare, browser).
Remove once root cause is identified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Defer slow GitHub org sync to workflow queue for fast OAuth callback

Split syncGithubSessionFromToken into a fast path (initGithubSession:
exchange code, get viewer, store token+identity) and a slow path
(syncGithubOrganizations: list orgs/installations, sync workspaces).

completeAppGithubAuth now returns the 302 redirect in ~2s instead of
~18s by enqueuing the org sync to the workspace workflow queue
(fire-and-forget). This eliminates the proxy timeout window that was
causing duplicate callback requests.

bootstrapAppGithubSession (dev-only) still calls the full synchronous
sync since proxy timeouts are not a concern and it needs the session
fully populated before returning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* foundry: async app repo import on org select

* foundry: parallelize app snapshot org reads

* repo: push all current workspace changes

* foundry: update runner version and snapshot logging

* Refactor Foundry GitHub state and sandbox runtime

Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache.

Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI.

Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules.

Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 02:45:07 -07:00

3.8 KiB

Task Creation Should Return After Actor Bootstrap

Read 00-end-to-end-async-realtime-plan.md first for the governing migration order, runtime constraints, and realtime client model this brief assumes.

Problem

Task creation currently waits for full provisioning: naming, repo checks, sandbox creation/resume, sandbox-agent install/start, sandbox-instance wiring, and session creation.

That makes a user-facing action depend on queue-backed and provider-backed work that can take minutes. The client only needs the task actor to exist so it can navigate to the task and observe progress.

Current Code Context

  • Workspace entry point: foundry/packages/backend/src/actors/workspace/actions.ts
  • Project task creation path: foundry/packages/backend/src/actors/project/actions.ts
  • Task action surface: foundry/packages/backend/src/actors/task/index.ts
  • Task workflow: foundry/packages/backend/src/actors/task/workflow/index.ts
  • Task init/provision steps: foundry/packages/backend/src/actors/task/workflow/init.ts
  • Provider-backed long steps currently happen inside the task provision workflow.

Target Contract

  • createTask returns once the task actor exists and initial task metadata is persisted.
  • The response includes the task identity the client needs for follow-up reads and subscriptions.
  • Provisioning continues in the background through the task workflow.
  • Progress and failure are surfaced through task state, history events, and workbench updates.

Proposed Fix

  1. Restore the async split between initialize and provision.
  2. Keep task.command.initialize responsible for:
    • creating the task actor
    • bootstrapping DB rows
    • persisting any immediately-known metadata
    • returning the current task record
  3. After initialize completes, enqueue task.command.provision with wait: false.
  4. Change workspace.createTask to:
    • create or resolve the project
    • create the task actor
    • call task.initialize(...)
    • stop awaiting task.provision(...)
    • broadcast a workbench/task update
    • return the task record immediately
  5. Persist a clear queued/running state for provisioning so the frontend can distinguish:
    • init_enqueue_provision
    • init_ensure_name
    • init_create_sandbox
    • init_ensure_agent
    • init_create_session
    • running
    • error

Files Likely To Change

  • foundry/packages/backend/src/actors/workspace/actions.ts
  • foundry/packages/backend/src/actors/project/actions.ts
  • foundry/packages/backend/src/actors/task/index.ts
  • foundry/packages/backend/src/actors/task/workflow/index.ts
  • foundry/packages/backend/src/actors/task/workflow/init.ts
  • foundry/packages/frontend/src/components/workspace-dashboard.tsx
  • foundry/packages/client/src/remote/workbench-client.ts

Client Impact

  • Task creation UI should navigate immediately to the task page.
  • The page should render a provisioning state from task status instead of treating create as an all-or-nothing spinner.
  • Any tab/session creation that depends on provisioning should observe task state and wait for readiness asynchronously.

Acceptance Criteria

  • Creating a task never waits on sandbox creation or session creation.
  • A timeout in provider setup does not make the original create request fail after several minutes.
  • After a backend restart, the task workflow can resume provisioning from durable state without requiring the client to retry create.

Implementation Notes

  • Preserve the existing task actor as the single writer for task runtime state.
  • Do not introduce a second creator path for task actors; keep one create/bootstrap path and one background provision path.
  • Fresh-agent check: verify that createWorkbenchTask and any dashboard create flow still have enough data to navigate immediately after this change.