sandbox-agent/foundry/packages/backend/CLAUDE.md
Nathan Flurry f45a467484
chore(foundry): migrate to actions (#262)
* feat(foundry): checkpoint actor and workspace refactor

* docs(foundry): add agent handoff context

* wip(foundry): continue actor refactor

* wip(foundry): capture remaining local changes

* Complete Foundry refactor checklist

* Fix Foundry validation fallout

* wip

* wip: convert all actors from workflow to plain run handlers

Workaround for RivetKit bug where c.queue.iter() never yields messages
for actors created via getOrCreate from another actor's context. The
queue accepts messages (visible in inspector) but the iterator hangs.
Sleep/wake fixes it, but actors with active connections never sleep.

Converted organization, github-data, task, and user actors from
run: workflow(...) to plain run: async (c) => { for await ... }.

Also fixes:
- Missing auth tables in org migration (auth_verification etc)
- default_model NOT NULL constraint on org profile upsert
- Nested workflow step in github-data (HistoryDivergedError)
- Removed --force from frontend Dockerfile pnpm install

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Convert all actors from queues/workflows to direct actions, lazy task creation

Major refactor replacing all queue-based workflow communication with direct
RivetKit action calls across all actors. This works around a RivetKit bug
where c.queue.iter() deadlocks for actors created from another actor's context.

Key changes:
- All actors (organization, task, user, audit-log, github-data) converted
  from run: workflow(...) to actions-only (no run handler, no queues)
- PR sync creates virtual task entries in org local DB instead of spawning
  task actors — prevents OOM from 200+ actors created simultaneously
- Task actors created lazily on first user interaction via getOrCreate,
  self-initialize from org's getTaskIndexEntry data
- Removed requireRepoExists cross-actor call (caused 500s), replaced with
  local resolveTaskRepoId from org's taskIndex table
- Fixed getOrganizationContext to thread overrides through all sync phases
- Fixed sandbox repo path (/home/user/repo for E2B compatibility)
- Fixed buildSessionDetail to skip transcript fetch for pending sessions
- Added process crash protection (uncaughtException/unhandledRejection)
- Fixed React infinite render loop in mock-layout useEffect dependencies
- Added sandbox listProcesses error handling for expired E2B sandboxes
- Set E2B sandbox timeout to 1 hour (was 5 min default)
- Updated CLAUDE.md with lazy task creation rules, no-silent-catch policy,
  React hook dependency safety rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix E2B sandbox timeout comment, frontend stability, and create-flow improvements

- Add TEMPORARY comment on E2B timeoutMs with pointer to rivetkit sandbox
  resilience proposal for when autoPause lands
- Fix React useEffect dependency stability in mock-layout and
  organization-dashboard to prevent infinite re-render loops
- Fix terminal-pane ref handling
- Improve create-flow service and tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:23:59 -07:00

12 KiB
Raw Permalink Blame History

Backend Notes

Actor Hierarchy

Keep the backend actor tree aligned with this shape unless we explicitly decide to change it:

OrganizationActor (direct coordinator for tasks)
├─ AuditLogActor (organization-scoped global feed)
├─ GithubDataActor
├─ TaskActor(task)
│  ├─ taskSessions      → session metadata/transcripts
│  └─ taskSandboxes     → sandbox instance index
└─ SandboxInstanceActor(sandboxProviderId, sandboxId) × N

Coordinator Pattern

Actors follow a coordinator pattern where each coordinator is responsible for:

  1. Index tables — keeping a local SQLite index/summary of its child actors' data
  2. Create/destroy — handling lifecycle of child actors
  3. Routing — resolving lookups to the correct child actor

Children push updates up to their direct coordinator only. Coordinators broadcast changes to connected clients. This keeps the read path local (no fan-out to children).

Coordinator hierarchy and index tables

OrganizationActor (coordinator for tasks + auth users)
│
│  Index tables:
│  ├─ taskIndex          → TaskActor index (taskId → repoId + branchName)
│  ├─ taskSummaries      → TaskActor materialized sidebar projection
│  ├─ authSessionIndex   → UserActor index (session token → userId)
│  ├─ authEmailIndex     → UserActor index (email → userId)
│  └─ authAccountIndex   → UserActor index (OAuth account → userId)
│
├─ TaskActor (coordinator for sessions + sandboxes)
│  │
│  │  Index tables:
│  │  ├─ taskWorkspaceSessions → Session index (session metadata + transcript)
│  │  └─ taskSandboxes         → SandboxInstanceActor index (sandbox history)
│  │
│  └─ SandboxInstanceActor (leaf)
│
├─ AuditLogActor (organization-scoped audit log, not a coordinator)
└─ GithubDataActor (GitHub API cache, not a coordinator)

When adding a new index table, annotate it in the schema file with a doc comment identifying it as a coordinator index and which child actor it indexes (see existing examples).

Lazy Task Actor Creation — CRITICAL

Task actors must NEVER be created during GitHub sync or bulk operations. Creating hundreds of task actors simultaneously causes OOM crashes. An org can have 200+ PRs; spawning an actor per PR kills the process.

The two creation points

There are exactly two places that may create a task actor:

  1. createTaskMutation in task-mutations.ts — the only backend code that calls getOrCreateTask. Triggered by explicit user action ("New Task" button). One actor at a time.

  2. backend-client.ts client helper — calls client.task.getOrCreate(...). This is the lazy materialization point: when a user clicks a virtual task in the sidebar, the client creates the actor, and it self-initializes in getCurrentRecord() (workflow/common.ts) by reading branch/title from the org's getTaskIndexEntry action.

The rule

The rule

Never use getOrCreateTask inside a sync loop, webhook handler, or any bulk operation. That's what caused the OOM — 186 actors spawned simultaneously during PR sync.

getOrCreateTask IS allowed in:

  • createTaskMutation — explicit user "New Task" action
  • requireWorkspaceTask — user-initiated actions (createSession, sendMessage, etc.) that may hit a virtual task
  • getTask action on the org — called by sandbox actor and client, needs to materialize virtual tasks
  • backend-client.ts client helper — lazy materialization when user views a task

Virtual tasks (PR-driven)

During PR sync, refreshTaskSummaryForBranchMutation is called for every changed PR (via github-data's emitPullRequestChangeEvents). It writes virtual task entries to the org actor's local taskIndex + taskSummaries tables only. No task actor is spawned. No cross-actor calls to task actors.

When the user interacts with a virtual task (clicks it, creates a session):

  1. Client or org actor calls getOrCreate on the task actor key → actor is created with empty DB
  2. Any action on the actor calls getCurrentRecord() → sees empty DB → reads branch/title from org's getTaskIndexEntry → calls initBootstrapDbActivity + initCompleteActivity → task is now real

Call sites to watch

  • refreshTaskSummaryForBranchMutation — called in bulk during sync. Must ONLY write to org local tables. Never create task actors or call task actor actions.
  • emitPullRequestChangeEvents in github-data — iterates all changed PRs. Must remain fire-and-forget with no actor fan-out.

Ownership Rules

  • OrganizationActor is the organization coordinator, direct coordinator for tasks, and lookup/index owner. It owns the task index, task summaries, and repo catalog.
  • AuditLogActor is organization-scoped. There is one organization-level audit log feed.
  • TaskActor is one branch. Treat 1 task = 1 branch once branch assignment is finalized.
  • TaskActor can have many sessions.
  • TaskActor can reference many sandbox instances historically, but should have only one active sandbox/session at a time.
  • Session unread state and draft prompts are backend-owned workspace state, not frontend-local state.
  • Branch names are immutable after task creation. Do not implement branch-rename flows.
  • SandboxInstanceActor stays separate from TaskActor; tasks/sessions reference it by identity.
  • The backend stores no local git state. No clones, no refs, no working trees, and no git-spice. Repository metadata comes from GitHub API data and webhook events. Any working-tree git operation runs inside a sandbox via executeInSandbox().
  • When a backend request path must aggregate multiple independent actor calls or reads, prefer bounded parallelism over sequential fan-out when correctness permits. Do not serialize independent work by default.
  • Only a coordinator creates/destroys its children. Do not create child actors from outside the coordinator.
  • Children push state changes up to their direct coordinator only. Task actors push summary updates directly to the organization actor.
  • Read paths must use the coordinator's local index tables. Do not fan out to child actors on the hot read path.
  • Never build "enriched" read actions that chain through multiple actors (e.g., coordinator → child actor → sibling actor). If data from multiple actors is needed for a read, it should already be materialized in the coordinator's index tables via push updates. If it's not there, fix the write path to push it — do not add a fan-out read path.

Drizzle Migration Maintenance

After changing any actor's db/schema.ts, you must regenerate the corresponding migration so the runtime creates the tables that match the schema. Forgetting this step causes no such table errors at runtime.

  1. Generate a new drizzle migration. Run from packages/backend:

    npx drizzle-kit generate --config=./src/actors/<actor>/db/drizzle.config.ts
    

    If the interactive prompt is unavailable (e.g. in a non-TTY), manually create a new .sql file under ./src/actors/<actor>/db/drizzle/ and add the corresponding entry to meta/_journal.json.

  2. Regenerate the compiled migrations.ts. Run from the foundry root:

    npx tsx packages/backend/src/actors/_scripts/generate-actor-migrations.ts
    
  3. Verify insert/upsert calls. Every column with .notNull() (and no .default(...)) must be provided a value in all insert() and onConflictDoUpdate() calls. Missing a NOT NULL column causes a runtime constraint violation, not a type error.

  4. Nuke RivetKit state in dev after migration changes to start fresh:

    docker compose -f compose.dev.yaml down
    docker volume rm foundry_foundry_rivetkit_storage
    docker compose -f compose.dev.yaml up -d
    

Actors with drizzle migrations: organization, audit-log, task. Other actors (user, github-data) use inline migrations without drizzle.

Workflow Step Nesting — FORBIDDEN

Never call c.step() / ctx.step() from inside another step's run callback. RivetKit workflow steps cannot be nested. Doing so causes the runtime error: "Cannot start a new workflow entry while another is in progress."

This means:

  • Functions called from within a step run callback must NOT use c.step(), c.loop(), c.sleep(), or c.queue.next().
  • If a mutation function needs to be called both from a step and standalone, it must only do plain DB/API work — no workflow primitives. The workflow step wrapping belongs in the workflow file, not in the mutation.
  • Helper wrappers that conditionally call c.step() (like a runSyncStep pattern) are dangerous — if the caller is already inside a step, the nested c.step() will crash at runtime with no compile-time warning.

Rule of thumb: Workflow primitives (step, loop, sleep, queue.next) may only appear at the top level of a workflow function or inside a loop callback — never inside a step's run.

SQLite Constraints

  • Single-row tables must use an integer primary key with CHECK (id = 1) to enforce the singleton invariant at the database level.
  • Follow the task actor pattern for metadata/profile rows and keep the fixed row id in code as 1, not a string sentinel.

Multiplayer Correctness

Per-user UI state must live on the user actor, not on shared task/session actors. This is critical for multiplayer — multiple users may view the same task simultaneously with different active sessions, unread states, and in-progress drafts.

Per-user state (user actor): active session tab, unread counts, draft text, draft attachments. Keyed by (userId, taskId, sessionId).

Task-global state (task actor): session transcript, session model, session runtime status, sandbox identity, task status, branch name, PR state. These are shared across all users viewing the task — that is correct behavior.

Do not store per-user preferences, selections, or ephemeral UI state on shared actors. If a field's value should differ between two users looking at the same task, it belongs on the user actor.

Audit Log Maintenance

Every new action or command handler that represents a user-visible or workflow-significant event must append to the audit log actor. The audit log must remain a comprehensive record of significant operations.

Debugging Actors

RivetKit Inspector UI

The RivetKit inspector UI at http://localhost:6420/ui/ is the most reliable way to debug actor state in local development. The inspector HTTP API (/inspector/workflow-history) has a known bug where it returns empty {} even when the workflow has entries — always cross-check with the UI.

Useful inspector URL pattern:

http://localhost:6420/ui/?u=http%3A%2F%2F127.0.0.1%3A6420&ns=default&r=default&n=[%22<actor-name>%22]&actorId=<actor-id>&tab=<tab>

Tabs: workflow, database, state, queue, connections, metadata.

To find actor IDs:

curl -s 'http://127.0.0.1:6420/actors?name=organization'

To query actor DB via bun (inside container):

docker compose -f compose.dev.yaml exec -T backend bun -e '
  var Database = require("bun:sqlite");
  var db = new Database("/root/.local/share/foundry/rivetkit/databases/<actor-id>.db", { readonly: true });
  console.log(JSON.stringify(db.query("SELECT name FROM sqlite_master WHERE type=?").all("table")));
'

To call actor actions via inspector:

curl -s -X POST 'http://127.0.0.1:6420/gateway/<actor-id>/inspector/action/<actionName>' \
  -H 'Content-Type: application/json' -d '{"args":[{}]}'

Known inspector API bugs

  • GET /inspector/workflow-history may return {"history":{}} even when workflow has run. Use the UI's Workflow tab instead.
  • GET /inspector/queue is reliable for checking pending messages.
  • GET /inspector/state is reliable for checking actor state.

Maintenance

  • Keep this file up to date whenever actor ownership, hierarchy, or lifecycle responsibilities change.
  • If the real actor tree diverges from this document, update this document in the same change.
  • When adding, removing, or renaming coordinator index tables, update the hierarchy diagram above in the same change.
  • When adding a new coordinator index table in a schema file, add a doc comment identifying which child actor it indexes (pattern: /** Coordinator index of {ChildActor} instances. ... */).