sandbox-agent/foundry/research/specs/async-action-fixes/00-end-to-end-async-realtime-plan.md

# End-To-End Async + Realtime Plan

## Purpose

This is the umbrella plan for the Foundry issues we traced across app shell, workbench, and actor runtime behavior:

- long-running work still sits inline in request/action paths
- monolithic snapshot reads fan out across too many actors
- the client uses polling and full refreshes where it should use realtime subscriptions
- websocket subscriptions reconnect too aggressively
- actor shutdown can race in-flight actions and clear `c.db` underneath them

The goal is not just to make individual endpoints faster. The goal is to move Foundry to a model where:

- request paths only validate, create minimal state, and enqueue background work
- list views read actor-owned projections instead of recomputing deep state
- detail views connect directly to the actor that owns the visible state
- polling is replaced by actor events and bounded bootstrap fetches
- actor shutdown drains active work before cleaning up resources

## Problem Summary

### App shell

- `getAppSnapshot` still rebuilds app shell state by reading the app session row and fanning out to every eligible organization actor.
- `RemoteFoundryAppStore` still polls every `500ms` while any org is `syncing`.
- Org sync/import is now off the select path, but the steady-state read path is still snapshot-based instead of subscription-based.

### Workbench

- `getWorkbench` still represents a monolithic organization read that aggregates repo, repository, and task state.
- The remote workbench store still responds to every event by pulling a full fresh snapshot.
- Some task/workbench detail is still too expensive to compute inline and too broad to refresh after every mutation.

### Realtime transport

- `subscribeWorkbench` and related connection helpers keep one connection per shared key, but the client contract still treats the socket as an invalidation channel for a later snapshot pull.
- Reconnect/error handling is weak, so connection churn amplifies backend load instead of settling into long-lived subscriptions.

### Runtime

- RivetKit currently lets shutdown proceed far enough to clean up actor resources while actions can still be in flight or still be routed to the actor.
- That creates the `Database not enabled` / missing `c.db` failure mode under stop/replay pressure.

## Target Architecture

### Request-path rule

Every request/action should do only one of these:

1. return actor-owned cached state
2. persist a cheap mutation
3. enqueue or signal background work

Requests should not block on provider calls, repo sync, sandbox provisioning, transcript enumeration, or deep cross-actor fan-out unless the UI cannot render at all without the result.

### View-model rule

- App shell view connects to app/session state and only the org actors visible on screen.
- Organization/task-list view connects to a organization-owned summary projection.
- Task detail view connects directly to the selected task actor.
- Sandbox/session detail connects only when the user opens that detail.

Do not replace one monolith with one connection per row. List screens should still come from actor-owned projections.

### Runtime rule

Stopping actors must stop accepting new work and must not clear actor resources until active actions and requests have drained or been cancelled.

## Workstreams

### 1. Runtime hardening first

This is the only workstream that is not Foundry-only. It should start immediately because it is the only direct fix for the `c.db` shutdown race.

#### Changes

1. Add active action/request accounting in RivetKit actor instances.
2. Mark actors as draining before cleanup starts.
3. Reject or reroute new requests/actions once draining begins.
4. Wait for active actions to finish or abort before `#cleanupDatabase()` runs.
5. Delay clearing `#db` until no active actions remain.
6. Add actor stop logs with:
   - actor id
   - active action count
   - active request count
   - drain start/end timestamps
   - cleanup start/end timestamps

#### Acceptance criteria

- No action can successfully enter user code after actor draining begins.
- `Database not enabled` cannot be produced by an in-flight action after stop has begun.
- Stop logs make it obvious whether shutdown delay is run-handler time, active-action drain time, background promise time, or routing delay.

### 2. App shell moves from snapshot polling to subscriptions

The app shell should stop using `/app/snapshot` as the steady-state read model.

#### Changes

1. Introduce a small app-shell projection owned by the app organization actor:
   - auth status
   - current user summary
   - active org id
   - visible org ids
   - per-org lightweight status summary
2. Add app actor events, for example:
   - `appSessionUpdated`
   - `activeOrganizationChanged`
   - `organizationSyncStatusChanged`
3. Expose connection helpers from the backend client for:
   - app actor subscription
   - organization actor subscription by id
4. Update `RemoteFoundryAppStore` so it:
   - does one bootstrap fetch on first subscribe
   - connects to the app actor for ongoing updates
   - connects only to the org actors needed for the current view
   - disposes org subscriptions when they are no longer visible
5. Remove `scheduleSyncPollingIfNeeded()` and the `500ms` refresh loop.

#### Likely files

- `foundry/packages/backend/src/actors/organization/app-shell.ts`
- `foundry/packages/client/src/backend-client.ts`
- `foundry/packages/client/src/remote/app-client.ts`
- `foundry/packages/shared/src/app-shell.ts`
- app shell frontend consumers

#### Acceptance criteria

- No app shell polling loop remains.
- Selecting an org returns quickly and the UI updates from actor events.
- App shell refresh cost is bounded by visible state, not every eligible organization on every poll.

### 3. Organization summary becomes a projection, not a full snapshot

The task list should read a organization-owned summary projection instead of calling into every task actor on each refresh.

#### Changes

1. Define a durable organization summary model with only list-screen fields:
   - repo summary
   - repository summary
   - task summary
   - selected/open task ids
   - unread/session status summary
   - coarse git/PR state summary
2. Update organization actor workflows so task/repository changes incrementally update this projection.
3. Change `getWorkbench` to return the projection only.
4. Change `workbenchUpdated` from "invalidate and refetch everything" to "here is the updated projection version or changed entity ids".
5. Remove task-actor fan-out from the default list read path.

#### Likely files

- `foundry/packages/backend/src/actors/organization/actions.ts`
- `foundry/packages/backend/src/actors/repository/actions.ts`
- `foundry/packages/backend/src/actors/task/index.ts`
- `foundry/packages/backend/src/actors/task/workbench.ts`
- task/organization DB schema and migrations
- `foundry/packages/client/src/remote/workbench-client.ts`

#### Acceptance criteria

- Workbench list refresh does not call every task actor.
- A websocket event does not force a full cross-actor rebuild.
- Initial task-list load time scales roughly with organization summary size, not repo count times task count times detail reads.

### 4. Task detail moves to direct actor reads and events

Heavy task detail should move out of the organization summary and into the selected task actor.

#### Changes

1. Split task detail into focused reads/subscriptions:
   - task header/meta
   - tabs/session summary
   - transcript stream
   - diff/file tree
   - sandbox process state
2. Open a task actor connection only for the selected task.
3. Open sandbox/session subscriptions only for the active tab/pane.
4. Dispose those subscriptions when the user changes selection.
5. Keep expensive derived state cached in actor-owned tables and update it from background jobs or event ingestion.

#### Acceptance criteria

- Opening the task list does not open connections to every task actor.
- Opening a task shows staged loading for heavy panes instead of blocking the whole workbench snapshot.
- Transcript, diff, and file-tree reads are not recomputed for unrelated tasks.

### 5. Finish moving long-running mutations to background workflows

This extends and completes the existing async-action briefs in this folder.

#### Existing briefs to implement under this workstream

1. `01-task-creation-bootstrap-only.md`
2. `02-repo-overview-from-cached-projection.md`
3. `03-repo-actions-via-background-workflow.md`
4. `04-workbench-session-creation-without-inline-provisioning.md`
5. `05-workbench-snapshot-from-derived-state.md`
6. `06-daytona-provisioning-staged-background-flow.md`

#### Additional rule

Every workflow-backed mutation should leave behind durable status rows or events that realtime clients can observe without polling.

### 6. Subscription lifecycle and reconnect behavior need one shared model

The current client-side connection pattern is too ad hoc. It needs a single lifecycle policy so sockets are long-lived and bounded.

#### Changes

1. Create one shared subscription manager in the client for:
   - reference counting
   - connection reuse
   - reconnect backoff
   - connection state events
   - clean disposal
2. Make invalidation optional. Prefer payload-bearing events or projection version updates.
3. Add structured logs/metrics in the client for:
   - connection created/disposed
   - reconnect attempts
   - subscription count per actor key
   - refresh triggered by event vs bootstrap vs mutation
4. Stop calling full `refresh()` after every mutation when the mutation result or follow-up event already contains enough state to update locally.

#### Acceptance criteria

- Idle screens maintain stable websocket counts.
- Transient socket failures do not create refresh storms.
- The client can explain why any given refresh happened.

### 7. Clean up HTTP surface after realtime migration

Do not delete bootstrap endpoints first. Shrink them after the subscription model is working.

#### Changes

1. Keep one-shot bootstrap/read endpoints only where they still add value:
   - initial app load
   - initial workbench load
   - deep-link fallback
2. Remove or de-emphasize monolithic snapshot endpoints for steady-state use.
3. Keep HTTP for control-plane and external integrations.

#### Acceptance criteria

- Main interactive screens do not depend on polling.
- Snapshot endpoints are bootstrap/fallback paths, not the primary UI contract.

## Suggested Implementation Order

1. Runtime hardening in RivetKit
2. `01-task-creation-bootstrap-only.md`
3. `03-repo-actions-via-background-workflow.md`
4. `06-daytona-provisioning-staged-background-flow.md`
5. App shell realtime subscription model
6. `02-repo-overview-from-cached-projection.md`
7. Organization summary projection
8. `04-workbench-session-creation-without-inline-provisioning.md`
9. `05-workbench-snapshot-from-derived-state.md`
10. Task-detail direct actor reads/subscriptions
11. Client subscription lifecycle cleanup
12. `07-auth-identity-simplification.md`

## Why This Order

- Runtime hardening removes the most dangerous correctness bug before more UI load shifts onto actor connections.
- The first async workflow items reduce the biggest user-visible stalls quickly.
- App shell realtime is smaller and lower-risk than the workbench migration, and it removes the current polling loop.
- Organization summary and task-detail split should happen after the async workflow moves so the projection model does not encode old synchronous assumptions.
- Auth simplification is valuable but not required to remove the current refresh/polling/runtime problems.

## Observability Requirements

Before or alongside implementation, add metrics/logs for:

- app snapshot bootstrap duration
- workbench bootstrap duration
- actor connection count by actor type and view
- reconnect count by actor key
- projection rebuild/update duration
- workflow queue latency
- actor drain duration and active-action counts during stop

Each log line should include a request id or actor/event correlation id where possible.

## Rollout Strategy

1. Ship runtime hardening and observability first.
2. Ship app-shell realtime behind a client flag while keeping snapshot bootstrap.
3. Ship organization summary projection behind a separate flag.
4. Migrate one heavy detail pane at a time off the monolithic workbench payload.
5. Remove polling once the matching event path is proven stable.
6. Only then remove or demote the old snapshot-heavy steady-state flows.

## Done Means

This initiative is done when all of the following are true:

- no user-visible screen depends on `500ms` polling
- no list view recomputes deep task/session/diff state inline on every refresh
- long-running repo/provider/sandbox work always runs in durable background workflows
- the client connects only to actors relevant to the current view and disposes them when the view changes
- websocket counts stay stable on idle screens
- actor shutdown cannot invalidate `c.db` underneath active actions