mirror of https://github.com/harivansh-afk/sandbox-agent.git synced 2026-04-17 08:01:03 +00:00

Nathan Flurry eacbc602e7 Refactor Foundry GitHub state and sandbox runtime

Refactors Foundry around organization/repository ownership and adds an organization-scoped GitHub state actor plus a user-scoped GitHub auth actor, removing the old project PR/branch sync actors and repo PR cache.

Updates sandbox provisioning to rely on sandbox-agent for in-sandbox work, hardens Daytona startup and image-build behavior, and surfaces runtime and task-startup errors more clearly in the UI.

Extends workbench and GitHub state handling to track merged PR state, adds runtime-issue tracking, refreshes client/test/config wiring, and documents the main live Foundry test flow plus actor coordination rules.

Also updates the remaining Sandbox Agent install-version references in docs/examples to the current pinned minor channel.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-03-13 02:43:00 -07:00

13 KiB

Raw Blame History

End-To-End Async + Realtime Plan

Purpose

This is the umbrella plan for the Foundry issues we traced across app shell, workbench, and actor runtime behavior:

long-running work still sits inline in request/action paths
monolithic snapshot reads fan out across too many actors
the client uses polling and full refreshes where it should use realtime subscriptions
websocket subscriptions reconnect too aggressively
actor shutdown can race in-flight actions and clear c.db underneath them

The goal is not just to make individual endpoints faster. The goal is to move Foundry to a model where:

request paths only validate, create minimal state, and enqueue background work
list views read actor-owned projections instead of recomputing deep state
detail views connect directly to the actor that owns the visible state
polling is replaced by actor events and bounded bootstrap fetches
actor shutdown drains active work before cleaning up resources

Problem Summary

App shell

getAppSnapshot still rebuilds app shell state by reading the app session row and fanning out to every eligible organization actor.
RemoteFoundryAppStore still polls every 500ms while any org is syncing.
Org sync/import is now off the select path, but the steady-state read path is still snapshot-based instead of subscription-based.

Workbench

getWorkbench still represents a monolithic workspace read that aggregates repo, project, and task state.
The remote workbench store still responds to every event by pulling a full fresh snapshot.
Some task/workbench detail is still too expensive to compute inline and too broad to refresh after every mutation.

Realtime transport

subscribeWorkbench and related connection helpers keep one connection per shared key, but the client contract still treats the socket as an invalidation channel for a later snapshot pull.
Reconnect/error handling is weak, so connection churn amplifies backend load instead of settling into long-lived subscriptions.

Runtime

RivetKit currently lets shutdown proceed far enough to clean up actor resources while actions can still be in flight or still be routed to the actor.
That creates the Database not enabled / missing c.db failure mode under stop/replay pressure.

Target Architecture

Request-path rule

Every request/action should do only one of these:

return actor-owned cached state
persist a cheap mutation
enqueue or signal background work

Requests should not block on provider calls, repo sync, sandbox provisioning, transcript enumeration, or deep cross-actor fan-out unless the UI cannot render at all without the result.

View-model rule

App shell view connects to app/session state and only the org actors visible on screen.
Workspace/task-list view connects to a workspace-owned summary projection.
Task detail view connects directly to the selected task actor.
Sandbox/session detail connects only when the user opens that detail.

Do not replace one monolith with one connection per row. List screens should still come from actor-owned projections.

Runtime rule

Stopping actors must stop accepting new work and must not clear actor resources until active actions and requests have drained or been cancelled.

Workstreams

1. Runtime hardening first

This is the only workstream that is not Foundry-only. It should start immediately because it is the only direct fix for the c.db shutdown race.

Changes

Add active action/request accounting in RivetKit actor instances.
Mark actors as draining before cleanup starts.
Reject or reroute new requests/actions once draining begins.
Wait for active actions to finish or abort before #cleanupDatabase() runs.
Delay clearing #db until no active actions remain.
Add actor stop logs with:
- actor id
- active action count
- active request count
- drain start/end timestamps
- cleanup start/end timestamps

Acceptance criteria

No action can successfully enter user code after actor draining begins.
Database not enabled cannot be produced by an in-flight action after stop has begun.
Stop logs make it obvious whether shutdown delay is run-handler time, active-action drain time, background promise time, or routing delay.

2. App shell moves from snapshot polling to subscriptions

The app shell should stop using /app/snapshot as the steady-state read model.

Changes

Introduce a small app-shell projection owned by the app workspace actor:
- auth status
- current user summary
- active org id
- visible org ids
- per-org lightweight status summary
Add app actor events, for example:
- appSessionUpdated
- activeOrganizationChanged
- organizationSyncStatusChanged
Expose connection helpers from the backend client for:
- app actor subscription
- organization actor subscription by id
Update RemoteFoundryAppStore so it:
- does one bootstrap fetch on first subscribe
- connects to the app actor for ongoing updates
- connects only to the org actors needed for the current view
- disposes org subscriptions when they are no longer visible
Remove scheduleSyncPollingIfNeeded() and the 500ms refresh loop.

Likely files

foundry/packages/backend/src/actors/workspace/app-shell.ts
foundry/packages/client/src/backend-client.ts
foundry/packages/client/src/remote/app-client.ts
foundry/packages/shared/src/app-shell.ts
app shell frontend consumers

Acceptance criteria

No app shell polling loop remains.
Selecting an org returns quickly and the UI updates from actor events.
App shell refresh cost is bounded by visible state, not every eligible organization on every poll.

3. Workspace summary becomes a projection, not a full snapshot

The task list should read a workspace-owned summary projection instead of calling into every task actor on each refresh.

Changes

Define a durable workspace summary model with only list-screen fields:
- repo summary
- project summary
- task summary
- selected/open task ids
- unread/session status summary
- coarse git/PR state summary
Update workspace actor workflows so task/project changes incrementally update this projection.
Change getWorkbench to return the projection only.
Change workbenchUpdated from "invalidate and refetch everything" to "here is the updated projection version or changed entity ids".
Remove task-actor fan-out from the default list read path.

Likely files

foundry/packages/backend/src/actors/workspace/actions.ts
foundry/packages/backend/src/actors/project/actions.ts
foundry/packages/backend/src/actors/task/index.ts
foundry/packages/backend/src/actors/task/workbench.ts
task/workspace DB schema and migrations
foundry/packages/client/src/remote/workbench-client.ts

Acceptance criteria

Workbench list refresh does not call every task actor.
A websocket event does not force a full cross-actor rebuild.
Initial task-list load time scales roughly with workspace summary size, not repo count times task count times detail reads.

4. Task detail moves to direct actor reads and events

Heavy task detail should move out of the workspace summary and into the selected task actor.

Changes

Split task detail into focused reads/subscriptions:
- task header/meta
- tabs/session summary
- transcript stream
- diff/file tree
- sandbox process state
Open a task actor connection only for the selected task.
Open sandbox/session subscriptions only for the active tab/pane.
Dispose those subscriptions when the user changes selection.
Keep expensive derived state cached in actor-owned tables and update it from background jobs or event ingestion.

Acceptance criteria

Opening the task list does not open connections to every task actor.
Opening a task shows staged loading for heavy panes instead of blocking the whole workbench snapshot.
Transcript, diff, and file-tree reads are not recomputed for unrelated tasks.

5. Finish moving long-running mutations to background workflows

This extends and completes the existing async-action briefs in this folder.

Existing briefs to implement under this workstream

01-task-creation-bootstrap-only.md
02-repo-overview-from-cached-projection.md
03-repo-actions-via-background-workflow.md
04-workbench-session-creation-without-inline-provisioning.md
05-workbench-snapshot-from-derived-state.md
06-daytona-provisioning-staged-background-flow.md

Additional rule

Every workflow-backed mutation should leave behind durable status rows or events that realtime clients can observe without polling.

6. Subscription lifecycle and reconnect behavior need one shared model

The current client-side connection pattern is too ad hoc. It needs a single lifecycle policy so sockets are long-lived and bounded.

Changes

Create one shared subscription manager in the client for:
- reference counting
- connection reuse
- reconnect backoff
- connection state events
- clean disposal
Make invalidation optional. Prefer payload-bearing events or projection version updates.
Add structured logs/metrics in the client for:
- connection created/disposed
- reconnect attempts
- subscription count per actor key
- refresh triggered by event vs bootstrap vs mutation
Stop calling full refresh() after every mutation when the mutation result or follow-up event already contains enough state to update locally.

Acceptance criteria

Idle screens maintain stable websocket counts.
Transient socket failures do not create refresh storms.
The client can explain why any given refresh happened.

7. Clean up HTTP surface after realtime migration

Do not delete bootstrap endpoints first. Shrink them after the subscription model is working.

Changes

Keep one-shot bootstrap/read endpoints only where they still add value:
- initial app load
- initial workbench load
- deep-link fallback
Remove or de-emphasize monolithic snapshot endpoints for steady-state use.
Keep HTTP for control-plane and external integrations.

Acceptance criteria

Main interactive screens do not depend on polling.
Snapshot endpoints are bootstrap/fallback paths, not the primary UI contract.

Suggested Implementation Order

Runtime hardening in RivetKit
01-task-creation-bootstrap-only.md
03-repo-actions-via-background-workflow.md
06-daytona-provisioning-staged-background-flow.md
App shell realtime subscription model
02-repo-overview-from-cached-projection.md
Workspace summary projection
04-workbench-session-creation-without-inline-provisioning.md
05-workbench-snapshot-from-derived-state.md
Task-detail direct actor reads/subscriptions
Client subscription lifecycle cleanup
07-auth-identity-simplification.md

Why This Order

Runtime hardening removes the most dangerous correctness bug before more UI load shifts onto actor connections.
The first async workflow items reduce the biggest user-visible stalls quickly.
App shell realtime is smaller and lower-risk than the workbench migration, and it removes the current polling loop.
Workspace summary and task-detail split should happen after the async workflow moves so the projection model does not encode old synchronous assumptions.
Auth simplification is valuable but not required to remove the current refresh/polling/runtime problems.

Observability Requirements

Before or alongside implementation, add metrics/logs for:

app snapshot bootstrap duration
workbench bootstrap duration
actor connection count by actor type and view
reconnect count by actor key
projection rebuild/update duration
workflow queue latency
actor drain duration and active-action counts during stop

Each log line should include a request id or actor/event correlation id where possible.

Rollout Strategy

Ship runtime hardening and observability first.
Ship app-shell realtime behind a client flag while keeping snapshot bootstrap.
Ship workspace summary projection behind a separate flag.
Migrate one heavy detail pane at a time off the monolithic workbench payload.
Remove polling once the matching event path is proven stable.
Only then remove or demote the old snapshot-heavy steady-state flows.

Done Means

This initiative is done when all of the following are true:

no user-visible screen depends on 500ms polling
no list view recomputes deep task/session/diff state inline on every refresh
long-running repo/provider/sandbox work always runs in durable background workflows
the client connects only to actors relevant to the current view and disposes them when the view changes
websocket counts stay stable on idle screens
actor shutdown cannot invalidate c.db underneath active actions

13 KiB Raw Blame History

End-To-End Async + Realtime Plan

Purpose

Problem Summary

App shell

Workbench

Realtime transport

Runtime

Target Architecture

Request-path rule

View-model rule

Runtime rule

Workstreams

1. Runtime hardening first

Changes

Acceptance criteria

2. App shell moves from snapshot polling to subscriptions

Changes

Likely files

Acceptance criteria

3. Workspace summary becomes a projection, not a full snapshot

Changes

Likely files

Acceptance criteria

4. Task detail moves to direct actor reads and events

Changes

Acceptance criteria

5. Finish moving long-running mutations to background workflows

Existing briefs to implement under this workstream

Additional rule

6. Subscription lifecycle and reconnect behavior need one shared model

Changes

Acceptance criteria

7. Clean up HTTP surface after realtime migration

Changes

Acceptance criteria

Suggested Implementation Order

Why This Order

Observability Requirements

Rollout Strategy

Done Means

13 KiB

Raw Blame History