sandbox-agent/foundry/research/specs/async-action-fixes/00-end-to-end-async-realtime-plan.md
2026-03-14 20:42:18 -07:00

308 lines
13 KiB
Markdown

# End-To-End Async + Realtime Plan
## Purpose
This is the umbrella plan for the Foundry issues we traced across app shell, workbench, and actor runtime behavior:
- long-running work still sits inline in request/action paths
- monolithic snapshot reads fan out across too many actors
- the client uses polling and full refreshes where it should use realtime subscriptions
- websocket subscriptions reconnect too aggressively
- actor shutdown can race in-flight actions and clear `c.db` underneath them
The goal is not just to make individual endpoints faster. The goal is to move Foundry to a model where:
- request paths only validate, create minimal state, and enqueue background work
- list views read actor-owned projections instead of recomputing deep state
- detail views connect directly to the actor that owns the visible state
- polling is replaced by actor events and bounded bootstrap fetches
- actor shutdown drains active work before cleaning up resources
## Problem Summary
### App shell
- `getAppSnapshot` still rebuilds app shell state by reading the app session row and fanning out to every eligible organization actor.
- `RemoteFoundryAppStore` still polls every `500ms` while any org is `syncing`.
- Org sync/import is now off the select path, but the steady-state read path is still snapshot-based instead of subscription-based.
### Workbench
- `getWorkbench` still represents a monolithic organization read that aggregates repo, repository, and task state.
- The remote workbench store still responds to every event by pulling a full fresh snapshot.
- Some task/workbench detail is still too expensive to compute inline and too broad to refresh after every mutation.
### Realtime transport
- `subscribeWorkbench` and related connection helpers keep one connection per shared key, but the client contract still treats the socket as an invalidation channel for a later snapshot pull.
- Reconnect/error handling is weak, so connection churn amplifies backend load instead of settling into long-lived subscriptions.
### Runtime
- RivetKit currently lets shutdown proceed far enough to clean up actor resources while actions can still be in flight or still be routed to the actor.
- That creates the `Database not enabled` / missing `c.db` failure mode under stop/replay pressure.
## Target Architecture
### Request-path rule
Every request/action should do only one of these:
1. return actor-owned cached state
2. persist a cheap mutation
3. enqueue or signal background work
Requests should not block on provider calls, repo sync, sandbox provisioning, transcript enumeration, or deep cross-actor fan-out unless the UI cannot render at all without the result.
### View-model rule
- App shell view connects to app/session state and only the org actors visible on screen.
- Organization/task-list view connects to a organization-owned summary projection.
- Task detail view connects directly to the selected task actor.
- Sandbox/session detail connects only when the user opens that detail.
Do not replace one monolith with one connection per row. List screens should still come from actor-owned projections.
### Runtime rule
Stopping actors must stop accepting new work and must not clear actor resources until active actions and requests have drained or been cancelled.
## Workstreams
### 1. Runtime hardening first
This is the only workstream that is not Foundry-only. It should start immediately because it is the only direct fix for the `c.db` shutdown race.
#### Changes
1. Add active action/request accounting in RivetKit actor instances.
2. Mark actors as draining before cleanup starts.
3. Reject or reroute new requests/actions once draining begins.
4. Wait for active actions to finish or abort before `#cleanupDatabase()` runs.
5. Delay clearing `#db` until no active actions remain.
6. Add actor stop logs with:
- actor id
- active action count
- active request count
- drain start/end timestamps
- cleanup start/end timestamps
#### Acceptance criteria
- No action can successfully enter user code after actor draining begins.
- `Database not enabled` cannot be produced by an in-flight action after stop has begun.
- Stop logs make it obvious whether shutdown delay is run-handler time, active-action drain time, background promise time, or routing delay.
### 2. App shell moves from snapshot polling to subscriptions
The app shell should stop using `/app/snapshot` as the steady-state read model.
#### Changes
1. Introduce a small app-shell projection owned by the app organization actor:
- auth status
- current user summary
- active org id
- visible org ids
- per-org lightweight status summary
2. Add app actor events, for example:
- `appSessionUpdated`
- `activeOrganizationChanged`
- `organizationSyncStatusChanged`
3. Expose connection helpers from the backend client for:
- app actor subscription
- organization actor subscription by id
4. Update `RemoteFoundryAppStore` so it:
- does one bootstrap fetch on first subscribe
- connects to the app actor for ongoing updates
- connects only to the org actors needed for the current view
- disposes org subscriptions when they are no longer visible
5. Remove `scheduleSyncPollingIfNeeded()` and the `500ms` refresh loop.
#### Likely files
- `foundry/packages/backend/src/actors/organization/app-shell.ts`
- `foundry/packages/client/src/backend-client.ts`
- `foundry/packages/client/src/remote/app-client.ts`
- `foundry/packages/shared/src/app-shell.ts`
- app shell frontend consumers
#### Acceptance criteria
- No app shell polling loop remains.
- Selecting an org returns quickly and the UI updates from actor events.
- App shell refresh cost is bounded by visible state, not every eligible organization on every poll.
### 3. Organization summary becomes a projection, not a full snapshot
The task list should read a organization-owned summary projection instead of calling into every task actor on each refresh.
#### Changes
1. Define a durable organization summary model with only list-screen fields:
- repo summary
- repository summary
- task summary
- selected/open task ids
- unread/session status summary
- coarse git/PR state summary
2. Update organization actor workflows so task/repository changes incrementally update this projection.
3. Change `getWorkbench` to return the projection only.
4. Change `workbenchUpdated` from "invalidate and refetch everything" to "here is the updated projection version or changed entity ids".
5. Remove task-actor fan-out from the default list read path.
#### Likely files
- `foundry/packages/backend/src/actors/organization/actions.ts`
- `foundry/packages/backend/src/actors/repository/actions.ts`
- `foundry/packages/backend/src/actors/task/index.ts`
- `foundry/packages/backend/src/actors/task/workbench.ts`
- task/organization DB schema and migrations
- `foundry/packages/client/src/remote/workbench-client.ts`
#### Acceptance criteria
- Workbench list refresh does not call every task actor.
- A websocket event does not force a full cross-actor rebuild.
- Initial task-list load time scales roughly with organization summary size, not repo count times task count times detail reads.
### 4. Task detail moves to direct actor reads and events
Heavy task detail should move out of the organization summary and into the selected task actor.
#### Changes
1. Split task detail into focused reads/subscriptions:
- task header/meta
- tabs/session summary
- transcript stream
- diff/file tree
- sandbox process state
2. Open a task actor connection only for the selected task.
3. Open sandbox/session subscriptions only for the active tab/pane.
4. Dispose those subscriptions when the user changes selection.
5. Keep expensive derived state cached in actor-owned tables and update it from background jobs or event ingestion.
#### Acceptance criteria
- Opening the task list does not open connections to every task actor.
- Opening a task shows staged loading for heavy panes instead of blocking the whole workbench snapshot.
- Transcript, diff, and file-tree reads are not recomputed for unrelated tasks.
### 5. Finish moving long-running mutations to background workflows
This extends and completes the existing async-action briefs in this folder.
#### Existing briefs to implement under this workstream
1. `01-task-creation-bootstrap-only.md`
2. `02-repo-overview-from-cached-projection.md`
3. `03-repo-actions-via-background-workflow.md`
4. `04-workbench-session-creation-without-inline-provisioning.md`
5. `05-workbench-snapshot-from-derived-state.md`
6. `06-daytona-provisioning-staged-background-flow.md`
#### Additional rule
Every workflow-backed mutation should leave behind durable status rows or events that realtime clients can observe without polling.
### 6. Subscription lifecycle and reconnect behavior need one shared model
The current client-side connection pattern is too ad hoc. It needs a single lifecycle policy so sockets are long-lived and bounded.
#### Changes
1. Create one shared subscription manager in the client for:
- reference counting
- connection reuse
- reconnect backoff
- connection state events
- clean disposal
2. Make invalidation optional. Prefer payload-bearing events or projection version updates.
3. Add structured logs/metrics in the client for:
- connection created/disposed
- reconnect attempts
- subscription count per actor key
- refresh triggered by event vs bootstrap vs mutation
4. Stop calling full `refresh()` after every mutation when the mutation result or follow-up event already contains enough state to update locally.
#### Acceptance criteria
- Idle screens maintain stable websocket counts.
- Transient socket failures do not create refresh storms.
- The client can explain why any given refresh happened.
### 7. Clean up HTTP surface after realtime migration
Do not delete bootstrap endpoints first. Shrink them after the subscription model is working.
#### Changes
1. Keep one-shot bootstrap/read endpoints only where they still add value:
- initial app load
- initial workbench load
- deep-link fallback
2. Remove or de-emphasize monolithic snapshot endpoints for steady-state use.
3. Keep HTTP for control-plane and external integrations.
#### Acceptance criteria
- Main interactive screens do not depend on polling.
- Snapshot endpoints are bootstrap/fallback paths, not the primary UI contract.
## Suggested Implementation Order
1. Runtime hardening in RivetKit
2. `01-task-creation-bootstrap-only.md`
3. `03-repo-actions-via-background-workflow.md`
4. `06-daytona-provisioning-staged-background-flow.md`
5. App shell realtime subscription model
6. `02-repo-overview-from-cached-projection.md`
7. Organization summary projection
8. `04-workbench-session-creation-without-inline-provisioning.md`
9. `05-workbench-snapshot-from-derived-state.md`
10. Task-detail direct actor reads/subscriptions
11. Client subscription lifecycle cleanup
12. `07-auth-identity-simplification.md`
## Why This Order
- Runtime hardening removes the most dangerous correctness bug before more UI load shifts onto actor connections.
- The first async workflow items reduce the biggest user-visible stalls quickly.
- App shell realtime is smaller and lower-risk than the workbench migration, and it removes the current polling loop.
- Organization summary and task-detail split should happen after the async workflow moves so the projection model does not encode old synchronous assumptions.
- Auth simplification is valuable but not required to remove the current refresh/polling/runtime problems.
## Observability Requirements
Before or alongside implementation, add metrics/logs for:
- app snapshot bootstrap duration
- workbench bootstrap duration
- actor connection count by actor type and view
- reconnect count by actor key
- projection rebuild/update duration
- workflow queue latency
- actor drain duration and active-action counts during stop
Each log line should include a request id or actor/event correlation id where possible.
## Rollout Strategy
1. Ship runtime hardening and observability first.
2. Ship app-shell realtime behind a client flag while keeping snapshot bootstrap.
3. Ship organization summary projection behind a separate flag.
4. Migrate one heavy detail pane at a time off the monolithic workbench payload.
5. Remove polling once the matching event path is proven stable.
6. Only then remove or demote the old snapshot-heavy steady-state flows.
## Done Means
This initiative is done when all of the following are true:
- no user-visible screen depends on `500ms` polling
- no list view recomputes deep task/session/diff state inline on every refresh
- long-running repo/provider/sandbox work always runs in durable background workflows
- the client connects only to actors relevant to the current view and disposes them when the view changes
- websocket counts stay stable on idle screens
- actor shutdown cannot invalidate `c.db` underneath active actions