Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache. Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI. Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules. Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
13 KiB
End-To-End Async + Realtime Plan
Purpose
This is the umbrella plan for the Foundry issues we traced across app shell, workbench, and actor runtime behavior:
- long-running work still sits inline in request/action paths
- monolithic snapshot reads fan out across too many actors
- the client uses polling and full refreshes where it should use realtime subscriptions
- websocket subscriptions reconnect too aggressively
- actor shutdown can race in-flight actions and clear
c.dbunderneath them
The goal is not just to make individual endpoints faster. The goal is to move Foundry to a model where:
- request paths only validate, create minimal state, and enqueue background work
- list views read actor-owned projections instead of recomputing deep state
- detail views connect directly to the actor that owns the visible state
- polling is replaced by actor events and bounded bootstrap fetches
- actor shutdown drains active work before cleaning up resources
Problem Summary
App shell
getAppSnapshotstill rebuilds app shell state by reading the app session row and fanning out to every eligible organization actor.RemoteFoundryAppStorestill polls every500mswhile any org issyncing.- Org sync/import is now off the select path, but the steady-state read path is still snapshot-based instead of subscription-based.
Workbench
getWorkbenchstill represents a monolithic workspace read that aggregates repo, project, and task state.- The remote workbench store still responds to every event by pulling a full fresh snapshot.
- Some task/workbench detail is still too expensive to compute inline and too broad to refresh after every mutation.
Realtime transport
subscribeWorkbenchand related connection helpers keep one connection per shared key, but the client contract still treats the socket as an invalidation channel for a later snapshot pull.- Reconnect/error handling is weak, so connection churn amplifies backend load instead of settling into long-lived subscriptions.
Runtime
- RivetKit currently lets shutdown proceed far enough to clean up actor resources while actions can still be in flight or still be routed to the actor.
- That creates the
Database not enabled/ missingc.dbfailure mode under stop/replay pressure.
Target Architecture
Request-path rule
Every request/action should do only one of these:
- return actor-owned cached state
- persist a cheap mutation
- enqueue or signal background work
Requests should not block on provider calls, repo sync, sandbox provisioning, transcript enumeration, or deep cross-actor fan-out unless the UI cannot render at all without the result.
View-model rule
- App shell view connects to app/session state and only the org actors visible on screen.
- Workspace/task-list view connects to a workspace-owned summary projection.
- Task detail view connects directly to the selected task actor.
- Sandbox/session detail connects only when the user opens that detail.
Do not replace one monolith with one connection per row. List screens should still come from actor-owned projections.
Runtime rule
Stopping actors must stop accepting new work and must not clear actor resources until active actions and requests have drained or been cancelled.
Workstreams
1. Runtime hardening first
This is the only workstream that is not Foundry-only. It should start immediately because it is the only direct fix for the c.db shutdown race.
Changes
- Add active action/request accounting in RivetKit actor instances.
- Mark actors as draining before cleanup starts.
- Reject or reroute new requests/actions once draining begins.
- Wait for active actions to finish or abort before
#cleanupDatabase()runs. - Delay clearing
#dbuntil no active actions remain. - Add actor stop logs with:
- actor id
- active action count
- active request count
- drain start/end timestamps
- cleanup start/end timestamps
Acceptance criteria
- No action can successfully enter user code after actor draining begins.
Database not enabledcannot be produced by an in-flight action after stop has begun.- Stop logs make it obvious whether shutdown delay is run-handler time, active-action drain time, background promise time, or routing delay.
2. App shell moves from snapshot polling to subscriptions
The app shell should stop using /app/snapshot as the steady-state read model.
Changes
- Introduce a small app-shell projection owned by the app workspace actor:
- auth status
- current user summary
- active org id
- visible org ids
- per-org lightweight status summary
- Add app actor events, for example:
appSessionUpdatedactiveOrganizationChangedorganizationSyncStatusChanged
- Expose connection helpers from the backend client for:
- app actor subscription
- organization actor subscription by id
- Update
RemoteFoundryAppStoreso it:- does one bootstrap fetch on first subscribe
- connects to the app actor for ongoing updates
- connects only to the org actors needed for the current view
- disposes org subscriptions when they are no longer visible
- Remove
scheduleSyncPollingIfNeeded()and the500msrefresh loop.
Likely files
foundry/packages/backend/src/actors/workspace/app-shell.tsfoundry/packages/client/src/backend-client.tsfoundry/packages/client/src/remote/app-client.tsfoundry/packages/shared/src/app-shell.ts- app shell frontend consumers
Acceptance criteria
- No app shell polling loop remains.
- Selecting an org returns quickly and the UI updates from actor events.
- App shell refresh cost is bounded by visible state, not every eligible organization on every poll.
3. Workspace summary becomes a projection, not a full snapshot
The task list should read a workspace-owned summary projection instead of calling into every task actor on each refresh.
Changes
- Define a durable workspace summary model with only list-screen fields:
- repo summary
- project summary
- task summary
- selected/open task ids
- unread/session status summary
- coarse git/PR state summary
- Update workspace actor workflows so task/project changes incrementally update this projection.
- Change
getWorkbenchto return the projection only. - Change
workbenchUpdatedfrom "invalidate and refetch everything" to "here is the updated projection version or changed entity ids". - Remove task-actor fan-out from the default list read path.
Likely files
foundry/packages/backend/src/actors/workspace/actions.tsfoundry/packages/backend/src/actors/project/actions.tsfoundry/packages/backend/src/actors/task/index.tsfoundry/packages/backend/src/actors/task/workbench.ts- task/workspace DB schema and migrations
foundry/packages/client/src/remote/workbench-client.ts
Acceptance criteria
- Workbench list refresh does not call every task actor.
- A websocket event does not force a full cross-actor rebuild.
- Initial task-list load time scales roughly with workspace summary size, not repo count times task count times detail reads.
4. Task detail moves to direct actor reads and events
Heavy task detail should move out of the workspace summary and into the selected task actor.
Changes
- Split task detail into focused reads/subscriptions:
- task header/meta
- tabs/session summary
- transcript stream
- diff/file tree
- sandbox process state
- Open a task actor connection only for the selected task.
- Open sandbox/session subscriptions only for the active tab/pane.
- Dispose those subscriptions when the user changes selection.
- Keep expensive derived state cached in actor-owned tables and update it from background jobs or event ingestion.
Acceptance criteria
- Opening the task list does not open connections to every task actor.
- Opening a task shows staged loading for heavy panes instead of blocking the whole workbench snapshot.
- Transcript, diff, and file-tree reads are not recomputed for unrelated tasks.
5. Finish moving long-running mutations to background workflows
This extends and completes the existing async-action briefs in this folder.
Existing briefs to implement under this workstream
01-task-creation-bootstrap-only.md02-repo-overview-from-cached-projection.md03-repo-actions-via-background-workflow.md04-workbench-session-creation-without-inline-provisioning.md05-workbench-snapshot-from-derived-state.md06-daytona-provisioning-staged-background-flow.md
Additional rule
Every workflow-backed mutation should leave behind durable status rows or events that realtime clients can observe without polling.
6. Subscription lifecycle and reconnect behavior need one shared model
The current client-side connection pattern is too ad hoc. It needs a single lifecycle policy so sockets are long-lived and bounded.
Changes
- Create one shared subscription manager in the client for:
- reference counting
- connection reuse
- reconnect backoff
- connection state events
- clean disposal
- Make invalidation optional. Prefer payload-bearing events or projection version updates.
- Add structured logs/metrics in the client for:
- connection created/disposed
- reconnect attempts
- subscription count per actor key
- refresh triggered by event vs bootstrap vs mutation
- Stop calling full
refresh()after every mutation when the mutation result or follow-up event already contains enough state to update locally.
Acceptance criteria
- Idle screens maintain stable websocket counts.
- Transient socket failures do not create refresh storms.
- The client can explain why any given refresh happened.
7. Clean up HTTP surface after realtime migration
Do not delete bootstrap endpoints first. Shrink them after the subscription model is working.
Changes
- Keep one-shot bootstrap/read endpoints only where they still add value:
- initial app load
- initial workbench load
- deep-link fallback
- Remove or de-emphasize monolithic snapshot endpoints for steady-state use.
- Keep HTTP for control-plane and external integrations.
Acceptance criteria
- Main interactive screens do not depend on polling.
- Snapshot endpoints are bootstrap/fallback paths, not the primary UI contract.
Suggested Implementation Order
- Runtime hardening in RivetKit
01-task-creation-bootstrap-only.md03-repo-actions-via-background-workflow.md06-daytona-provisioning-staged-background-flow.md- App shell realtime subscription model
02-repo-overview-from-cached-projection.md- Workspace summary projection
04-workbench-session-creation-without-inline-provisioning.md05-workbench-snapshot-from-derived-state.md- Task-detail direct actor reads/subscriptions
- Client subscription lifecycle cleanup
07-auth-identity-simplification.md
Why This Order
- Runtime hardening removes the most dangerous correctness bug before more UI load shifts onto actor connections.
- The first async workflow items reduce the biggest user-visible stalls quickly.
- App shell realtime is smaller and lower-risk than the workbench migration, and it removes the current polling loop.
- Workspace summary and task-detail split should happen after the async workflow moves so the projection model does not encode old synchronous assumptions.
- Auth simplification is valuable but not required to remove the current refresh/polling/runtime problems.
Observability Requirements
Before or alongside implementation, add metrics/logs for:
- app snapshot bootstrap duration
- workbench bootstrap duration
- actor connection count by actor type and view
- reconnect count by actor key
- projection rebuild/update duration
- workflow queue latency
- actor drain duration and active-action counts during stop
Each log line should include a request id or actor/event correlation id where possible.
Rollout Strategy
- Ship runtime hardening and observability first.
- Ship app-shell realtime behind a client flag while keeping snapshot bootstrap.
- Ship workspace summary projection behind a separate flag.
- Migrate one heavy detail pane at a time off the monolithic workbench payload.
- Remove polling once the matching event path is proven stable.
- Only then remove or demote the old snapshot-heavy steady-state flows.
Done Means
This initiative is done when all of the following are true:
- no user-visible screen depends on
500mspolling - no list view recomputes deep task/session/diff state inline on every refresh
- long-running repo/provider/sandbox work always runs in durable background workflows
- the client connects only to actors relevant to the current view and disposes them when the view changes
- websocket counts stay stable on idle screens
- actor shutdown cannot invalidate
c.dbunderneath active actions