mirror of
https://github.com/harivansh-afk/sandbox-agent.git
synced 2026-04-18 21:00:49 +00:00
308 lines
13 KiB
Markdown
308 lines
13 KiB
Markdown
# End-To-End Async + Realtime Plan
|
|
|
|
## Purpose
|
|
|
|
This is the umbrella plan for the Foundry issues we traced across app shell, workbench, and actor runtime behavior:
|
|
|
|
- long-running work still sits inline in request/action paths
|
|
- monolithic snapshot reads fan out across too many actors
|
|
- the client uses polling and full refreshes where it should use realtime subscriptions
|
|
- websocket subscriptions reconnect too aggressively
|
|
- actor shutdown can race in-flight actions and clear `c.db` underneath them
|
|
|
|
The goal is not just to make individual endpoints faster. The goal is to move Foundry to a model where:
|
|
|
|
- request paths only validate, create minimal state, and enqueue background work
|
|
- list views read actor-owned projections instead of recomputing deep state
|
|
- detail views connect directly to the actor that owns the visible state
|
|
- polling is replaced by actor events and bounded bootstrap fetches
|
|
- actor shutdown drains active work before cleaning up resources
|
|
|
|
## Problem Summary
|
|
|
|
### App shell
|
|
|
|
- `getAppSnapshot` still rebuilds app shell state by reading the app session row and fanning out to every eligible organization actor.
|
|
- `RemoteFoundryAppStore` still polls every `500ms` while any org is `syncing`.
|
|
- Org sync/import is now off the select path, but the steady-state read path is still snapshot-based instead of subscription-based.
|
|
|
|
### Workbench
|
|
|
|
- `getWorkbench` still represents a monolithic organization read that aggregates repo, repository, and task state.
|
|
- The remote workbench store still responds to every event by pulling a full fresh snapshot.
|
|
- Some task/workbench detail is still too expensive to compute inline and too broad to refresh after every mutation.
|
|
|
|
### Realtime transport
|
|
|
|
- `subscribeWorkbench` and related connection helpers keep one connection per shared key, but the client contract still treats the socket as an invalidation channel for a later snapshot pull.
|
|
- Reconnect/error handling is weak, so connection churn amplifies backend load instead of settling into long-lived subscriptions.
|
|
|
|
### Runtime
|
|
|
|
- RivetKit currently lets shutdown proceed far enough to clean up actor resources while actions can still be in flight or still be routed to the actor.
|
|
- That creates the `Database not enabled` / missing `c.db` failure mode under stop/replay pressure.
|
|
|
|
## Target Architecture
|
|
|
|
### Request-path rule
|
|
|
|
Every request/action should do only one of these:
|
|
|
|
1. return actor-owned cached state
|
|
2. persist a cheap mutation
|
|
3. enqueue or signal background work
|
|
|
|
Requests should not block on provider calls, repo sync, sandbox provisioning, transcript enumeration, or deep cross-actor fan-out unless the UI cannot render at all without the result.
|
|
|
|
### View-model rule
|
|
|
|
- App shell view connects to app/session state and only the org actors visible on screen.
|
|
- Organization/task-list view connects to a organization-owned summary projection.
|
|
- Task detail view connects directly to the selected task actor.
|
|
- Sandbox/session detail connects only when the user opens that detail.
|
|
|
|
Do not replace one monolith with one connection per row. List screens should still come from actor-owned projections.
|
|
|
|
### Runtime rule
|
|
|
|
Stopping actors must stop accepting new work and must not clear actor resources until active actions and requests have drained or been cancelled.
|
|
|
|
## Workstreams
|
|
|
|
### 1. Runtime hardening first
|
|
|
|
This is the only workstream that is not Foundry-only. It should start immediately because it is the only direct fix for the `c.db` shutdown race.
|
|
|
|
#### Changes
|
|
|
|
1. Add active action/request accounting in RivetKit actor instances.
|
|
2. Mark actors as draining before cleanup starts.
|
|
3. Reject or reroute new requests/actions once draining begins.
|
|
4. Wait for active actions to finish or abort before `#cleanupDatabase()` runs.
|
|
5. Delay clearing `#db` until no active actions remain.
|
|
6. Add actor stop logs with:
|
|
- actor id
|
|
- active action count
|
|
- active request count
|
|
- drain start/end timestamps
|
|
- cleanup start/end timestamps
|
|
|
|
#### Acceptance criteria
|
|
|
|
- No action can successfully enter user code after actor draining begins.
|
|
- `Database not enabled` cannot be produced by an in-flight action after stop has begun.
|
|
- Stop logs make it obvious whether shutdown delay is run-handler time, active-action drain time, background promise time, or routing delay.
|
|
|
|
### 2. App shell moves from snapshot polling to subscriptions
|
|
|
|
The app shell should stop using `/app/snapshot` as the steady-state read model.
|
|
|
|
#### Changes
|
|
|
|
1. Introduce a small app-shell projection owned by the app organization actor:
|
|
- auth status
|
|
- current user summary
|
|
- active org id
|
|
- visible org ids
|
|
- per-org lightweight status summary
|
|
2. Add app actor events, for example:
|
|
- `appSessionUpdated`
|
|
- `activeOrganizationChanged`
|
|
- `organizationSyncStatusChanged`
|
|
3. Expose connection helpers from the backend client for:
|
|
- app actor subscription
|
|
- organization actor subscription by id
|
|
4. Update `RemoteFoundryAppStore` so it:
|
|
- does one bootstrap fetch on first subscribe
|
|
- connects to the app actor for ongoing updates
|
|
- connects only to the org actors needed for the current view
|
|
- disposes org subscriptions when they are no longer visible
|
|
5. Remove `scheduleSyncPollingIfNeeded()` and the `500ms` refresh loop.
|
|
|
|
#### Likely files
|
|
|
|
- `foundry/packages/backend/src/actors/organization/app-shell.ts`
|
|
- `foundry/packages/client/src/backend-client.ts`
|
|
- `foundry/packages/client/src/remote/app-client.ts`
|
|
- `foundry/packages/shared/src/app-shell.ts`
|
|
- app shell frontend consumers
|
|
|
|
#### Acceptance criteria
|
|
|
|
- No app shell polling loop remains.
|
|
- Selecting an org returns quickly and the UI updates from actor events.
|
|
- App shell refresh cost is bounded by visible state, not every eligible organization on every poll.
|
|
|
|
### 3. Organization summary becomes a projection, not a full snapshot
|
|
|
|
The task list should read a organization-owned summary projection instead of calling into every task actor on each refresh.
|
|
|
|
#### Changes
|
|
|
|
1. Define a durable organization summary model with only list-screen fields:
|
|
- repo summary
|
|
- repository summary
|
|
- task summary
|
|
- selected/open task ids
|
|
- unread/session status summary
|
|
- coarse git/PR state summary
|
|
2. Update organization actor workflows so task/repository changes incrementally update this projection.
|
|
3. Change `getWorkbench` to return the projection only.
|
|
4. Change `workbenchUpdated` from "invalidate and refetch everything" to "here is the updated projection version or changed entity ids".
|
|
5. Remove task-actor fan-out from the default list read path.
|
|
|
|
#### Likely files
|
|
|
|
- `foundry/packages/backend/src/actors/organization/actions.ts`
|
|
- `foundry/packages/backend/src/actors/repository/actions.ts`
|
|
- `foundry/packages/backend/src/actors/task/index.ts`
|
|
- `foundry/packages/backend/src/actors/task/workbench.ts`
|
|
- task/organization DB schema and migrations
|
|
- `foundry/packages/client/src/remote/workbench-client.ts`
|
|
|
|
#### Acceptance criteria
|
|
|
|
- Workbench list refresh does not call every task actor.
|
|
- A websocket event does not force a full cross-actor rebuild.
|
|
- Initial task-list load time scales roughly with organization summary size, not repo count times task count times detail reads.
|
|
|
|
### 4. Task detail moves to direct actor reads and events
|
|
|
|
Heavy task detail should move out of the organization summary and into the selected task actor.
|
|
|
|
#### Changes
|
|
|
|
1. Split task detail into focused reads/subscriptions:
|
|
- task header/meta
|
|
- tabs/session summary
|
|
- transcript stream
|
|
- diff/file tree
|
|
- sandbox process state
|
|
2. Open a task actor connection only for the selected task.
|
|
3. Open sandbox/session subscriptions only for the active tab/pane.
|
|
4. Dispose those subscriptions when the user changes selection.
|
|
5. Keep expensive derived state cached in actor-owned tables and update it from background jobs or event ingestion.
|
|
|
|
#### Acceptance criteria
|
|
|
|
- Opening the task list does not open connections to every task actor.
|
|
- Opening a task shows staged loading for heavy panes instead of blocking the whole workbench snapshot.
|
|
- Transcript, diff, and file-tree reads are not recomputed for unrelated tasks.
|
|
|
|
### 5. Finish moving long-running mutations to background workflows
|
|
|
|
This extends and completes the existing async-action briefs in this folder.
|
|
|
|
#### Existing briefs to implement under this workstream
|
|
|
|
1. `01-task-creation-bootstrap-only.md`
|
|
2. `02-repo-overview-from-cached-projection.md`
|
|
3. `03-repo-actions-via-background-workflow.md`
|
|
4. `04-workbench-session-creation-without-inline-provisioning.md`
|
|
5. `05-workbench-snapshot-from-derived-state.md`
|
|
6. `06-daytona-provisioning-staged-background-flow.md`
|
|
|
|
#### Additional rule
|
|
|
|
Every workflow-backed mutation should leave behind durable status rows or events that realtime clients can observe without polling.
|
|
|
|
### 6. Subscription lifecycle and reconnect behavior need one shared model
|
|
|
|
The current client-side connection pattern is too ad hoc. It needs a single lifecycle policy so sockets are long-lived and bounded.
|
|
|
|
#### Changes
|
|
|
|
1. Create one shared subscription manager in the client for:
|
|
- reference counting
|
|
- connection reuse
|
|
- reconnect backoff
|
|
- connection state events
|
|
- clean disposal
|
|
2. Make invalidation optional. Prefer payload-bearing events or projection version updates.
|
|
3. Add structured logs/metrics in the client for:
|
|
- connection created/disposed
|
|
- reconnect attempts
|
|
- subscription count per actor key
|
|
- refresh triggered by event vs bootstrap vs mutation
|
|
4. Stop calling full `refresh()` after every mutation when the mutation result or follow-up event already contains enough state to update locally.
|
|
|
|
#### Acceptance criteria
|
|
|
|
- Idle screens maintain stable websocket counts.
|
|
- Transient socket failures do not create refresh storms.
|
|
- The client can explain why any given refresh happened.
|
|
|
|
### 7. Clean up HTTP surface after realtime migration
|
|
|
|
Do not delete bootstrap endpoints first. Shrink them after the subscription model is working.
|
|
|
|
#### Changes
|
|
|
|
1. Keep one-shot bootstrap/read endpoints only where they still add value:
|
|
- initial app load
|
|
- initial workbench load
|
|
- deep-link fallback
|
|
2. Remove or de-emphasize monolithic snapshot endpoints for steady-state use.
|
|
3. Keep HTTP for control-plane and external integrations.
|
|
|
|
#### Acceptance criteria
|
|
|
|
- Main interactive screens do not depend on polling.
|
|
- Snapshot endpoints are bootstrap/fallback paths, not the primary UI contract.
|
|
|
|
## Suggested Implementation Order
|
|
|
|
1. Runtime hardening in RivetKit
|
|
2. `01-task-creation-bootstrap-only.md`
|
|
3. `03-repo-actions-via-background-workflow.md`
|
|
4. `06-daytona-provisioning-staged-background-flow.md`
|
|
5. App shell realtime subscription model
|
|
6. `02-repo-overview-from-cached-projection.md`
|
|
7. Organization summary projection
|
|
8. `04-workbench-session-creation-without-inline-provisioning.md`
|
|
9. `05-workbench-snapshot-from-derived-state.md`
|
|
10. Task-detail direct actor reads/subscriptions
|
|
11. Client subscription lifecycle cleanup
|
|
12. `07-auth-identity-simplification.md`
|
|
|
|
## Why This Order
|
|
|
|
- Runtime hardening removes the most dangerous correctness bug before more UI load shifts onto actor connections.
|
|
- The first async workflow items reduce the biggest user-visible stalls quickly.
|
|
- App shell realtime is smaller and lower-risk than the workbench migration, and it removes the current polling loop.
|
|
- Organization summary and task-detail split should happen after the async workflow moves so the projection model does not encode old synchronous assumptions.
|
|
- Auth simplification is valuable but not required to remove the current refresh/polling/runtime problems.
|
|
|
|
## Observability Requirements
|
|
|
|
Before or alongside implementation, add metrics/logs for:
|
|
|
|
- app snapshot bootstrap duration
|
|
- workbench bootstrap duration
|
|
- actor connection count by actor type and view
|
|
- reconnect count by actor key
|
|
- projection rebuild/update duration
|
|
- workflow queue latency
|
|
- actor drain duration and active-action counts during stop
|
|
|
|
Each log line should include a request id or actor/event correlation id where possible.
|
|
|
|
## Rollout Strategy
|
|
|
|
1. Ship runtime hardening and observability first.
|
|
2. Ship app-shell realtime behind a client flag while keeping snapshot bootstrap.
|
|
3. Ship organization summary projection behind a separate flag.
|
|
4. Migrate one heavy detail pane at a time off the monolithic workbench payload.
|
|
5. Remove polling once the matching event path is proven stable.
|
|
6. Only then remove or demote the old snapshot-heavy steady-state flows.
|
|
|
|
## Done Means
|
|
|
|
This initiative is done when all of the following are true:
|
|
|
|
- no user-visible screen depends on `500ms` polling
|
|
- no list view recomputes deep task/session/diff state inline on every refresh
|
|
- long-running repo/provider/sandbox work always runs in durable background workflows
|
|
- the client connects only to actors relevant to the current view and disposes them when the view changes
|
|
- websocket counts stay stable on idle screens
|
|
- actor shutdown cannot invalidate `c.db` underneath active actions
|