sandbox-agent/foundry/packages/backend/CLAUDE.md
Nathan Flurry f45a467484
chore(foundry): migrate to actions (#262)
* feat(foundry): checkpoint actor and workspace refactor

* docs(foundry): add agent handoff context

* wip(foundry): continue actor refactor

* wip(foundry): capture remaining local changes

* Complete Foundry refactor checklist

* Fix Foundry validation fallout

* wip

* wip: convert all actors from workflow to plain run handlers

Workaround for RivetKit bug where c.queue.iter() never yields messages
for actors created via getOrCreate from another actor's context. The
queue accepts messages (visible in inspector) but the iterator hangs.
Sleep/wake fixes it, but actors with active connections never sleep.

Converted organization, github-data, task, and user actors from
run: workflow(...) to plain run: async (c) => { for await ... }.

Also fixes:
- Missing auth tables in org migration (auth_verification etc)
- default_model NOT NULL constraint on org profile upsert
- Nested workflow step in github-data (HistoryDivergedError)
- Removed --force from frontend Dockerfile pnpm install

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Convert all actors from queues/workflows to direct actions, lazy task creation

Major refactor replacing all queue-based workflow communication with direct
RivetKit action calls across all actors. This works around a RivetKit bug
where c.queue.iter() deadlocks for actors created from another actor's context.

Key changes:
- All actors (organization, task, user, audit-log, github-data) converted
  from run: workflow(...) to actions-only (no run handler, no queues)
- PR sync creates virtual task entries in org local DB instead of spawning
  task actors — prevents OOM from 200+ actors created simultaneously
- Task actors created lazily on first user interaction via getOrCreate,
  self-initialize from org's getTaskIndexEntry data
- Removed requireRepoExists cross-actor call (caused 500s), replaced with
  local resolveTaskRepoId from org's taskIndex table
- Fixed getOrganizationContext to thread overrides through all sync phases
- Fixed sandbox repo path (/home/user/repo for E2B compatibility)
- Fixed buildSessionDetail to skip transcript fetch for pending sessions
- Added process crash protection (uncaughtException/unhandledRejection)
- Fixed React infinite render loop in mock-layout useEffect dependencies
- Added sandbox listProcesses error handling for expired E2B sandboxes
- Set E2B sandbox timeout to 1 hour (was 5 min default)
- Updated CLAUDE.md with lazy task creation rules, no-silent-catch policy,
  React hook dependency safety rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix E2B sandbox timeout comment, frontend stability, and create-flow improvements

- Add TEMPORARY comment on E2B timeoutMs with pointer to rivetkit sandbox
  resilience proposal for when autoPause lands
- Fix React useEffect dependency stability in mock-layout and
  organization-dashboard to prevent infinite re-render loops
- Fix terminal-pane ref handling
- Improve create-flow service and tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:23:59 -07:00

206 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Backend Notes
## Actor Hierarchy
Keep the backend actor tree aligned with this shape unless we explicitly decide to change it:
```text
OrganizationActor (direct coordinator for tasks)
├─ AuditLogActor (organization-scoped global feed)
├─ GithubDataActor
├─ TaskActor(task)
│ ├─ taskSessions → session metadata/transcripts
│ └─ taskSandboxes → sandbox instance index
└─ SandboxInstanceActor(sandboxProviderId, sandboxId) × N
```
## Coordinator Pattern
Actors follow a coordinator pattern where each coordinator is responsible for:
1. **Index tables** — keeping a local SQLite index/summary of its child actors' data
2. **Create/destroy** — handling lifecycle of child actors
3. **Routing** — resolving lookups to the correct child actor
Children push updates **up** to their direct coordinator only. Coordinators broadcast changes to connected clients. This keeps the read path local (no fan-out to children).
### Coordinator hierarchy and index tables
```text
OrganizationActor (coordinator for tasks + auth users)
│ Index tables:
│ ├─ taskIndex → TaskActor index (taskId → repoId + branchName)
│ ├─ taskSummaries → TaskActor materialized sidebar projection
│ ├─ authSessionIndex → UserActor index (session token → userId)
│ ├─ authEmailIndex → UserActor index (email → userId)
│ └─ authAccountIndex → UserActor index (OAuth account → userId)
├─ TaskActor (coordinator for sessions + sandboxes)
│ │
│ │ Index tables:
│ │ ├─ taskWorkspaceSessions → Session index (session metadata + transcript)
│ │ └─ taskSandboxes → SandboxInstanceActor index (sandbox history)
│ │
│ └─ SandboxInstanceActor (leaf)
├─ AuditLogActor (organization-scoped audit log, not a coordinator)
└─ GithubDataActor (GitHub API cache, not a coordinator)
```
When adding a new index table, annotate it in the schema file with a doc comment identifying it as a coordinator index and which child actor it indexes (see existing examples).
## Lazy Task Actor Creation — CRITICAL
**Task actors must NEVER be created during GitHub sync or bulk operations.** Creating hundreds of task actors simultaneously causes OOM crashes. An org can have 200+ PRs; spawning an actor per PR kills the process.
### The two creation points
There are exactly **two** places that may create a task actor:
1. **`createTaskMutation`** in `task-mutations.ts` — the only backend code that calls `getOrCreateTask`. Triggered by explicit user action ("New Task" button). One actor at a time.
2. **`backend-client.ts` client helper** — calls `client.task.getOrCreate(...)`. This is the lazy materialization point: when a user clicks a virtual task in the sidebar, the client creates the actor, and it self-initializes in `getCurrentRecord()` (`workflow/common.ts`) by reading branch/title from the org's `getTaskIndexEntry` action.
### The rule
### The rule
**Never use `getOrCreateTask` inside a sync loop, webhook handler, or any bulk operation.** That's what caused the OOM — 186 actors spawned simultaneously during PR sync.
`getOrCreateTask` IS allowed in:
- `createTaskMutation` — explicit user "New Task" action
- `requireWorkspaceTask` — user-initiated actions (createSession, sendMessage, etc.) that may hit a virtual task
- `getTask` action on the org — called by sandbox actor and client, needs to materialize virtual tasks
- `backend-client.ts` client helper — lazy materialization when user views a task
### Virtual tasks (PR-driven)
During PR sync, `refreshTaskSummaryForBranchMutation` is called for every changed PR (via github-data's `emitPullRequestChangeEvents`). It writes **virtual task entries** to the org actor's local `taskIndex` + `taskSummaries` tables only. No task actor is spawned. No cross-actor calls to task actors.
When the user interacts with a virtual task (clicks it, creates a session):
1. Client or org actor calls `getOrCreate` on the task actor key → actor is created with empty DB
2. Any action on the actor calls `getCurrentRecord()` → sees empty DB → reads branch/title from org's `getTaskIndexEntry` → calls `initBootstrapDbActivity` + `initCompleteActivity` → task is now real
### Call sites to watch
- `refreshTaskSummaryForBranchMutation` — called in bulk during sync. Must ONLY write to org local tables. Never create task actors or call task actor actions.
- `emitPullRequestChangeEvents` in github-data — iterates all changed PRs. Must remain fire-and-forget with no actor fan-out.
## Ownership Rules
- `OrganizationActor` is the organization coordinator, direct coordinator for tasks, and lookup/index owner. It owns the task index, task summaries, and repo catalog.
- `AuditLogActor` is organization-scoped. There is one organization-level audit log feed.
- `TaskActor` is one branch. Treat `1 task = 1 branch` once branch assignment is finalized.
- `TaskActor` can have many sessions.
- `TaskActor` can reference many sandbox instances historically, but should have only one active sandbox/session at a time.
- Session unread state and draft prompts are backend-owned workspace state, not frontend-local state.
- Branch names are immutable after task creation. Do not implement branch-rename flows.
- `SandboxInstanceActor` stays separate from `TaskActor`; tasks/sessions reference it by identity.
- The backend stores no local git state. No clones, no refs, no working trees, and no git-spice. Repository metadata comes from GitHub API data and webhook events. Any working-tree git operation runs inside a sandbox via `executeInSandbox()`.
- When a backend request path must aggregate multiple independent actor calls or reads, prefer bounded parallelism over sequential fan-out when correctness permits. Do not serialize independent work by default.
- Only a coordinator creates/destroys its children. Do not create child actors from outside the coordinator.
- Children push state changes up to their direct coordinator only. Task actors push summary updates directly to the organization actor.
- Read paths must use the coordinator's local index tables. Do not fan out to child actors on the hot read path.
- Never build "enriched" read actions that chain through multiple actors (e.g., coordinator → child actor → sibling actor). If data from multiple actors is needed for a read, it should already be materialized in the coordinator's index tables via push updates. If it's not there, fix the write path to push it — do not add a fan-out read path.
## Drizzle Migration Maintenance
After changing any actor's `db/schema.ts`, you **must** regenerate the corresponding migration so the runtime creates the tables that match the schema. Forgetting this step causes `no such table` errors at runtime.
1. **Generate a new drizzle migration.** Run from `packages/backend`:
```bash
npx drizzle-kit generate --config=./src/actors/<actor>/db/drizzle.config.ts
```
If the interactive prompt is unavailable (e.g. in a non-TTY), manually create a new `.sql` file under `./src/actors/<actor>/db/drizzle/` and add the corresponding entry to `meta/_journal.json`.
2. **Regenerate the compiled `migrations.ts`.** Run from the foundry root:
```bash
npx tsx packages/backend/src/actors/_scripts/generate-actor-migrations.ts
```
3. **Verify insert/upsert calls.** Every column with `.notNull()` (and no `.default(...)`) must be provided a value in all `insert()` and `onConflictDoUpdate()` calls. Missing a NOT NULL column causes a runtime constraint violation, not a type error.
4. **Nuke RivetKit state in dev** after migration changes to start fresh:
```bash
docker compose -f compose.dev.yaml down
docker volume rm foundry_foundry_rivetkit_storage
docker compose -f compose.dev.yaml up -d
```
Actors with drizzle migrations: `organization`, `audit-log`, `task`. Other actors (`user`, `github-data`) use inline migrations without drizzle.
## Workflow Step Nesting — FORBIDDEN
**Never call `c.step()` / `ctx.step()` from inside another step's `run` callback.** RivetKit workflow steps cannot be nested. Doing so causes the runtime error: *"Cannot start a new workflow entry while another is in progress."*
This means:
- Functions called from within a step `run` callback must NOT use `c.step()`, `c.loop()`, `c.sleep()`, or `c.queue.next()`.
- If a mutation function needs to be called both from a step and standalone, it must only do plain DB/API work — no workflow primitives. The workflow step wrapping belongs in the workflow file, not in the mutation.
- Helper wrappers that conditionally call `c.step()` (like a `runSyncStep` pattern) are dangerous — if the caller is already inside a step, the nested `c.step()` will crash at runtime with no compile-time warning.
**Rule of thumb:** Workflow primitives (`step`, `loop`, `sleep`, `queue.next`) may only appear at the top level of a workflow function or inside a `loop` callback — never inside a step's `run`.
## SQLite Constraints
- Single-row tables must use an integer primary key with `CHECK (id = 1)` to enforce the singleton invariant at the database level.
- Follow the task actor pattern for metadata/profile rows and keep the fixed row id in code as `1`, not a string sentinel.
## Multiplayer Correctness
Per-user UI state must live on the user actor, not on shared task/session actors. This is critical for multiplayer — multiple users may view the same task simultaneously with different active sessions, unread states, and in-progress drafts.
**Per-user state (user actor):** active session tab, unread counts, draft text, draft attachments. Keyed by `(userId, taskId, sessionId)`.
**Task-global state (task actor):** session transcript, session model, session runtime status, sandbox identity, task status, branch name, PR state. These are shared across all users viewing the task — that is correct behavior.
Do not store per-user preferences, selections, or ephemeral UI state on shared actors. If a field's value should differ between two users looking at the same task, it belongs on the user actor.
## Audit Log Maintenance
Every new action or command handler that represents a user-visible or workflow-significant event must append to the audit log actor. The audit log must remain a comprehensive record of significant operations.
## Debugging Actors
### RivetKit Inspector UI
The RivetKit inspector UI at `http://localhost:6420/ui/` is the most reliable way to debug actor state in local development. The inspector HTTP API (`/inspector/workflow-history`) has a known bug where it returns empty `{}` even when the workflow has entries — always cross-check with the UI.
**Useful inspector URL pattern:**
```
http://localhost:6420/ui/?u=http%3A%2F%2F127.0.0.1%3A6420&ns=default&r=default&n=[%22<actor-name>%22]&actorId=<actor-id>&tab=<tab>
```
Tabs: `workflow`, `database`, `state`, `queue`, `connections`, `metadata`.
**To find actor IDs:**
```bash
curl -s 'http://127.0.0.1:6420/actors?name=organization'
```
**To query actor DB via bun (inside container):**
```bash
docker compose -f compose.dev.yaml exec -T backend bun -e '
var Database = require("bun:sqlite");
var db = new Database("/root/.local/share/foundry/rivetkit/databases/<actor-id>.db", { readonly: true });
console.log(JSON.stringify(db.query("SELECT name FROM sqlite_master WHERE type=?").all("table")));
'
```
**To call actor actions via inspector:**
```bash
curl -s -X POST 'http://127.0.0.1:6420/gateway/<actor-id>/inspector/action/<actionName>' \
-H 'Content-Type: application/json' -d '{"args":[{}]}'
```
### Known inspector API bugs
- `GET /inspector/workflow-history` may return `{"history":{}}` even when workflow has run. Use the UI's Workflow tab instead.
- `GET /inspector/queue` is reliable for checking pending messages.
- `GET /inspector/state` is reliable for checking actor state.
## Maintenance
- Keep this file up to date whenever actor ownership, hierarchy, or lifecycle responsibilities change.
- If the real actor tree diverges from this document, update this document in the same change.
- When adding, removing, or renaming coordinator index tables, update the hierarchy diagram above in the same change.
- When adding a new coordinator index table in a schema file, add a doc comment identifying which child actor it indexes (pattern: `/** Coordinator index of {ChildActor} instances. ... */`).