feat(foundry): task owner git auth + manual owner change UI (#263)

* Add task owner git auth proposal and sandbox architecture docs

- Add proposal for primary user per task with OAuth token injection
  for sandbox git operations (.context/proposal-task-owner-git-auth.md)
- Document sandbox architecture constraints in CLAUDE.md: single sandbox
  per task assumption, OAuth token security implications, git auto-auth
  requirement, and git error surfacing rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add proposals for reverting to queues and rivetkit sandbox resilience

- proposal-revert-actions-to-queues.md: Detailed plan for reverting the
  actions-only pattern back to queues/workflows now that the RivetKit
  queue.iter() bug is fixed. Lists what to keep (lazy tasks, resolveTaskRepoId,
  sync override threading, E2B fixes, frontend fixes) vs what to revert
  (communication pattern only).

- proposal-rivetkit-sandbox-resilience.md: Rivetkit sandbox actor changes for
  handling destroyed/paused sandboxes, keep-alive, and the UNIQUE constraint
  crash fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(foundry): add manual task owner change via UI dropdown

Add an owner dropdown to the Overview tab that lets users reassign task
ownership to any organization member. The owner's GitHub credentials are
used for git operations in the sandbox.

Full-stack implementation:
- Backend: changeTaskOwnerManually action on task actor, routed through
  org actor's changeWorkspaceTaskOwner action, with primaryUser schema
  columns on both task and org index tables
- Client: changeOwner method on workspace client (mock + remote)
- Frontend: owner dropdown in right sidebar Overview tab showing org
  members, with avatar and role display
- Shared: TaskWorkspaceChangeOwnerInput type and primaryUser fields on
  workspace snapshot types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Nathan Flurry 2026-03-16 17:05:11 -07:00 committed by GitHub
parent 167712ace7
commit 4111aebfce
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
22 changed files with 1114 additions and 11 deletions

View file

@ -0,0 +1,165 @@
# Proposal: Revert Actions-Only Pattern Back to Queues/Workflows
## Background
We converted all actors from queue/workflow-based communication to direct actions as a workaround for a RivetKit bug where `c.queue.iter()` deadlocked for actors created from another actor's context. That bug has since been fixed in RivetKit. We want to revert to queues/workflows because they provide better observability (workflow history in the inspector), replay/recovery semantics, and are the idiomatic RivetKit pattern.
## Reference branches
- **`main`** at commit `32f3c6c3` — the original queue/workflow code BEFORE the actions refactor
- **`queues-to-actions`** — the current refactored code using direct actions
- **`task-owner-git-auth`** at commit `f45a4674` — the merged PR #262 that introduced the actions pattern
Use `main` as the reference for the queue/workflow communication patterns. Use `queues-to-actions` as the reference for bug fixes and new features that MUST be preserved.
## What to KEEP (do NOT revert these)
These are bug fixes and improvements made during the actions refactor that are independent of the communication pattern:
### 1. Lazy task actor creation
- Virtual task entries in org's `taskIndex` + `taskSummaries` tables (no actor fan-out during PR sync)
- `refreshTaskSummaryForBranchMutation` writes directly to org tables instead of spawning task actors
- Task actors self-initialize in `getCurrentRecord()` from `getTaskIndexEntry` when lazily created
- `getTaskIndexEntry` action on org actor
- See CLAUDE.md "Lazy Task Actor Creation" section
### 2. `resolveTaskRepoId` replacing `requireRepoExists`
- `requireRepoExists` was removed — it did a cross-actor call from org to github-data that was fragile
- Replaced with `resolveTaskRepoId` which reads from the org's local `taskIndex` table
- `getTask` action resolves `repoId` from task index when not provided (sandbox actor only has taskId)
### 3. `getOrganizationContext` overrides threaded through sync phases
- `fullSyncBranchBatch`, `fullSyncMembers`, `fullSyncPullRequestBatch` now pass `connectedAccount`, `installationStatus`, `installationId` overrides from `FullSyncConfig`
- Without this, phases 2-4 fail with "Organization not initialized" when the org profile doesn't exist yet (webhook-triggered sync before user sign-in)
### 4. E2B sandbox fixes
- `timeoutMs: 60 * 60 * 1000` in E2B create options (TEMPORARY until rivetkit autoPause lands)
- Sandbox repo path uses `/home/user/repo` for E2B compatibility
- `listProcesses` error handling for expired E2B sandboxes
### 5. Frontend fixes
- React `useEffect` dependency stability in `mock-layout.tsx` and `organization-dashboard.tsx` (prevents infinite re-render loops)
- Terminal pane ref handling
### 6. Process crash protection
- `process.on("uncaughtException")` and `process.on("unhandledRejection")` handlers in `foundry/packages/backend/src/index.ts`
### 7. CLAUDE.md updates
- All new sections: lazy task creation rules, no-silent-catch policy, React hook dependency safety, dev workflow instructions, debugging section
### 8. `requireWorkspaceTask` uses `getOrCreate`
- User-initiated actions (createSession, sendMessage, etc.) use `getOrCreate` to lazily materialize virtual tasks
- The `getOrCreate` call passes `{ organizationId, repoId, taskId }` as `createWithInput`
### 9. `getTask` uses `getOrCreate` with `resolveTaskRepoId`
- When `repoId` is not provided (sandbox actor), resolves from task index
- Uses `getOrCreate` since the task may be virtual
### 10. Audit log deleted workflow file
- `foundry/packages/backend/src/actors/audit-log/workflow.ts` was deleted
- The audit-log actor was simplified to a single `append` action
- Keep this simplification — audit-log doesn't need a workflow
## What to REVERT (communication pattern only)
For each actor, revert from direct action calls back to queue sends with `expectQueueResponse` / fire-and-forget patterns. The reference for the queue patterns is `main` at `32f3c6c3`.
### 1. Organization actor (`foundry/packages/backend/src/actors/organization/`)
**`index.ts`:**
- Revert from actions-only to `run: workflow(runOrganizationWorkflow)`
- Keep the actions that are pure reads (getAppSnapshot, getOrganizationSummarySnapshot, etc.)
- Mutations should go through the workflow queue command loop
**`workflow.ts`:**
- Restore `runOrganizationWorkflow` with the `ctx.loop("organization-command-loop", ...)` that dispatches queue names to mutation handlers
- Restore `ORGANIZATION_QUEUE_NAMES` and `COMMAND_HANDLERS`
- Restore `organizationWorkflowQueueName()` helper
**`app-shell.ts`:**
- Revert direct action calls back to queue sends: `sendOrganizationCommand(org, "organization.command.X", body)` pattern
- Revert `githubData.syncRepos(...)``githubData.send(githubDataWorkflowQueueName("syncRepos"), ...)`
- But KEEP the `getOrganizationContext` override threading fix
**`actions/tasks.ts`:**
- Keep `resolveTaskRepoId` (replacing `requireRepoExists`)
- Keep `requireWorkspaceTask` using `getOrCreate`
- Keep `getTask` using `getOrCreate` with `resolveTaskRepoId`
- Keep `getTaskIndexEntry`
- Revert task actor calls from direct actions to queue sends where applicable
**`actions/task-mutations.ts`:**
- Keep lazy task creation (virtual entries in org tables)
- Revert `taskHandle.initialize(...)``taskHandle.send(taskWorkflowQueueName("task.command.initialize"), ...)`
- Revert `task.pullRequestSync(...)``task.send(taskWorkflowQueueName("task.command.pullRequestSync"), ...)`
- Revert `auditLog.append(...)``auditLog.send("auditLog.command.append", ...)`
**`actions/organization.ts`:**
- Revert direct calls to org workflow back to queue sends
**`actions/github.ts`:**
- Revert direct calls back to queue sends
### 2. Task actor (`foundry/packages/backend/src/actors/task/`)
**`index.ts`:**
- Revert from actions-only to `run: workflow(runTaskWorkflow)` (or plain `run` with queue iteration)
- Keep read actions: `get`, `getTaskSummary`, `getTaskDetail`, `getSessionDetail`
**`workflow/index.ts`:**
- Restore `taskCommandActions` as queue handlers in the workflow command loop
- Restore `TASK_QUEUE_NAMES` and dispatch map
**`workspace.ts`:**
- Revert sandbox/org action calls back to queue sends where they were queue-based before
### 3. User actor (`foundry/packages/backend/src/actors/user/`)
**`index.ts`:**
- Revert from actions-only to `run: workflow(runUserWorkflow)` (or plain run with queue iteration)
**`workflow.ts`:**
- Restore queue command loop dispatching to mutation functions
### 4. GitHub-data actor (`foundry/packages/backend/src/actors/github-data/`)
**`index.ts`:**
- Revert from actions-only to having a run handler with queue iteration
- Keep the `getOrganizationContext` override threading fix
- Keep the `actionTimeout: 10 * 60_000` for long sync operations
### 5. Audit-log actor
- Keep as actions-only (simplified). No need to revert — it's simpler with just `append`.
### 6. Callers
**`foundry/packages/backend/src/services/better-auth.ts`:**
- Revert direct user actor action calls back to queue sends
**`foundry/packages/backend/src/actors/sandbox/index.ts`:**
- Revert `organization.getTask(...)` → queue send if it was queue-based before
- Keep the E2B timeout fix and listProcesses error handling
## Step-by-step procedure
1. Create a new branch from `task-owner-git-auth` (current HEAD)
2. For each actor, open a 3-way comparison: `main` (original queues), `queues-to-actions` (current), and your working copy
3. Restore queue/workflow run handlers and command loops from `main`
4. Restore queue name helpers and constants from `main`
5. Restore caller sites to use queue sends from `main`
6. Carefully preserve all items in the "KEEP" list above
7. Test: `cd foundry && docker compose -f compose.dev.yaml up -d`, sign in, verify GitHub sync completes, verify tasks show in sidebar, verify session creation works
8. Nuke RivetKit data between test runs: `docker volume rm foundry_foundry_rivetkit_storage`
## Verification checklist
- [ ] GitHub sync completes (160 repos for rivet-dev)
- [ ] Tasks show in sidebar (from PR sync, lazy/virtual entries)
- [ ] No task actors spawned during sync (check RivetKit inspector — should see 0 task actors until user clicks one)
- [ ] Clicking a task materializes the actor (lazy creation via getOrCreate)
- [ ] Session creation works on sandbox-agent-testing repo
- [ ] E2B sandbox provisions and connects
- [ ] Agent responds to messages
- [ ] No 500 errors in backend logs (except expected E2B sandbox expiry)
- [ ] Workflow history visible in RivetKit inspector for org, task, user actors
- [ ] CLAUDE.md constraints still documented and respected

View file

@ -0,0 +1,94 @@
# Proposal: RivetKit Sandbox Actor Resilience
## Context
The rivetkit sandbox actor (`src/sandbox/actor.ts`) does not handle the case where the underlying cloud sandbox (e.g. E2B VM) is destroyed while the actor is still alive. This causes cascading 500 errors when the actor tries to call the dead sandbox. Additionally, a UNIQUE constraint bug in event persistence crashes the host process.
The sandbox-agent repo (which defines the E2B provider) will be updated separately to use `autoPause` and expose `pause()`/typed errors. This proposal covers the rivetkit-side changes needed to handle those signals.
## Changes
### 1. Fix `persistObservedEnvelope` UNIQUE constraint crash
**File:** `insertEvent` in the sandbox actor's SQLite persistence layer
The `sandbox_agent_events` table has a UNIQUE constraint on `(session_id, event_index)`. When the same event is observed twice (reconnection, replay, duplicate WebSocket delivery), the insert throws and crashes the host process as an unhandled rejection.
**Fix:** Change the INSERT to `INSERT OR IGNORE` / `ON CONFLICT DO NOTHING`. Duplicate events are expected and harmless — they should be silently deduplicated at the persistence layer.
### 2. Handle destroyed sandbox in `ensureAgent()`
**File:** `src/sandbox/actor.ts``ensureAgent()` function
When the provider's `start()` is called with an existing `sandboxId` and the sandbox no longer exists, the provider throws a typed `SandboxDestroyedError` (defined in the sandbox-agent provider contract).
`ensureAgent()` should catch this error and check the `onSandboxExpired` config option:
```typescript
// New config option on sandboxActor()
onSandboxExpired?: "destroy" | "recreate"; // default: "destroy"
```
**`"destroy"` (default):**
- Set `state.sandboxDestroyed = true`
- Emit `sandboxExpired` event to all connected clients
- All subsequent action calls (runProcess, createSession, etc.) return a clear error: "Sandbox has expired. Create a new task to continue."
- The sandbox actor stays alive (preserves session history, audit log) but rejects new work
**`"recreate"`:**
- Call provider `create()` to provision a fresh sandbox
- Store new `sandboxId` in state
- Emit `sandboxRecreated` event to connected clients with a notice that sessions are lost (new VM, no prior state)
- Resume normal operation with the new sandbox
### 3. Expose `pause` action
**File:** `src/sandbox/actor.ts` — actions
Add a `pause` action that delegates to the provider's `pause()` method. This is user-initiated only (e.g. user clicks "Pause sandbox" in UI to save credits). The sandbox actor should never auto-pause.
```typescript
async pause(c) {
await c.provider.pause();
state.sandboxPaused = true;
c.broadcast("sandboxPaused", {});
}
```
### 4. Expose `resume` action
**File:** `src/sandbox/actor.ts` — actions
Add a `resume` action for explicit recovery. Calls `provider.start({ sandboxId: state.sandboxId })` which auto-resumes if paused.
```typescript
async resume(c) {
await ensureAgent(c); // handles reconnect internally
state.sandboxPaused = false;
c.broadcast("sandboxResumed", {});
}
```
### 5. Keep-alive while sessions are active
**File:** `src/sandbox/actor.ts`
While the sandbox actor has connected WebSocket clients, periodically extend the underlying sandbox TTL to prevent it from being garbage collected mid-session.
- On first client connect: start a keep-alive interval (e.g. every 2 minutes)
- Each tick: call `provider.extendTimeout(extensionMs)` (the provider maps this to `sandbox.setTimeout()` for E2B)
- On last client disconnect: clear the interval, let the sandbox idle toward its natural timeout
This prevents the common case where a user is actively working but the sandbox expires because the E2B default timeout (5 min) is too short. The `timeoutMs` in create options is the initial TTL; keep-alive extends it dynamically.
## Key invariant
**Never silently fail.** Every destroyed/expired/error state must be surfaced to connected clients via events. The actor must always tell the UI what happened so the user can act on it. See CLAUDE.md "never silently catch errors" rule.
## Dependencies
These changes depend on the sandbox-agent provider contract exposing:
- `pause()` method
- `extendTimeout(ms)` method
- Typed `SandboxDestroyedError` thrown from `start()` when sandbox is gone
- `start()` auto-resuming paused sandboxes via `Sandbox.connect(sandboxId)`

View file

@ -0,0 +1,200 @@
# Proposal: Task Primary Owner & Git Authentication
## Problem
Sandbox git operations (commit, push, PR creation) require authentication.
Currently, the sandbox has no user-scoped credentials. The E2B sandbox
clones repos using the GitHub App installation token, but push operations
need user-scoped auth so commits are attributed correctly and branch
protection rules are enforced.
## Design
### Concept: Primary User per Task
Each task has a **primary user** (the "owner"). This is the last user who
sent a message on the task. Their GitHub OAuth credentials are injected
into the sandbox for git operations. When the owner changes, the sandbox
git config and credentials swap to the new user.
### Data Model
**Task actor DB** -- new `task_owner` single-row table:
- `primaryUserId` (text) -- better-auth user ID
- `primaryGithubLogin` (text) -- GitHub username (for `git config user.name`)
- `primaryGithubEmail` (text) -- GitHub email (for `git config user.email`)
- `primaryGithubAvatarUrl` (text) -- avatar for UI display
- `updatedAt` (integer)
**Org coordinator** -- add to `taskSummaries` table:
- `primaryUserLogin` (text, nullable)
- `primaryUserAvatarUrl` (text, nullable)
### Owner Swap Flow
Triggered when `sendWorkspaceMessage` is called with a different user than
the current primary:
1. `sendWorkspaceMessage(authSessionId, ...)` resolves user from auth session
2. Look up user's GitHub identity from auth account table (`providerId = "github"`)
3. Compare `primaryUserId` with current owner. If different:
a. Update `task_owner` row in task actor DB
b. Get user's OAuth `accessToken` from auth account
c. Push into sandbox via `runProcess`:
- `git config user.name "{login}"`
- `git config user.email "{email}"`
- Write token to `/home/user/.git-token` (or equivalent)
d. Push updated task summary to org coordinator (includes `primaryUserLogin`)
e. Broadcast `taskUpdated` to connected clients
4. If same user, no-op (token is still valid)
### Token Injection
The user's GitHub OAuth token (stored in better-auth account table) has
`repo` scope (verified -- see `better-auth.ts` line 480: `scope: ["read:org", "repo"]`).
This is a standard **OAuth App** flow (not GitHub App OAuth). OAuth App
tokens do not expire unless explicitly revoked. No refresh logic is needed.
**Injection method:**
On first sandbox repo setup (`ensureSandboxRepo`), configure:
```bash
# Write token file
echo "{token}" > /home/user/.git-token
chmod 600 /home/user/.git-token
# Configure git to use it
git config --global credential.helper 'store --file=/home/user/.git-token'
# Format: https://{login}:{token}@github.com
echo "https://{login}:{token}@github.com" > /home/user/.git-token
```
On owner swap, overwrite `/home/user/.git-token` with new user's credentials.
**Important: git should never prompt for credentials.** The credential
store file ensures all git operations are auto-authenticated. No
`GIT_ASKPASS` prompts, no interactive auth.
**Race condition (expected behavior):** If User A sends a message and the
agent starts a long git operation, then User B sends a message and triggers
an owner swap, the in-flight git process still has User A's credentials
(already read from the credential store). The next git operation uses
User B's credentials. This is expected behavior -- document in comments.
### Token Validity
OAuth App tokens (our flow) do not expire. They persist until the user
revokes them or the OAuth App is deauthorized. No periodic refresh needed.
If a token becomes invalid (user revokes), git operations will fail with
a 401. The error surfaces through the standard `ensureSandboxRepo` /
`runProcess` error path and is displayed in the UI.
### User Removal
When a user is removed from the organization:
1. Org actor queries active tasks with that user as primary owner
2. For each, clear the `task_owner` row
3. Task actor clears the sandbox git credentials (overwrite credential file)
4. Push updated task summaries to org coordinator
5. Subsequent git operations fail with "No active owner -- assign an owner to enable git operations"
### UI Changes
**Right sidebar -- new "Overview" tab:**
- Add as a new tab alongside "Changes" and "All Files"
- Shows current primary user: avatar, name, login
- Click on the user -> dropdown of all workspace users (from org member list)
- Select a user -> triggers explicit owner swap (same flow as message-triggered)
- Also shows task metadata: branch, repo, created date
**Left sidebar -- task items:**
- Show primary user's GitHub login in green text next to task name
- Only shown when there is an active owner
**Task detail header:**
- Show small avatar of primary user next to task title
### Org Coordinator
`commandApplyTaskSummaryUpdate` already receives the full task summary
from the task actor. Add `primaryUserLogin` and `primaryUserAvatarUrl`
to the summary payload. The org writes it to `taskSummaries`. The sidebar
reads it from the org snapshot.
### Sandbox Architecture Note
Structurally, the system supports multiple sandboxes per task, but in
practice there is exactly one active sandbox per task. Design the owner
injection assuming one sandbox. The token is injected into the active
sandbox only. If multi-sandbox support is needed in the future, extend
the injection to target specific sandbox IDs.
## Security Considerations
### OAuth Token Scope
The user's GitHub OAuth token has `repo` scope, which grants **full control
of all private repositories** the user has access to. When injected into
the sandbox:
- The agent can read/write ANY repo the user has access to, not just the
task's target repo
- The token persists in the sandbox filesystem until overwritten
- Any process running in the sandbox can read the credential file
**Mitigations:**
- Credential file has `chmod 600` (owner-read-only)
- Sandbox is isolated per-task (E2B VM boundary)
- Token is overwritten on owner swap (old user's token removed)
- Token is cleared on user removal from org
- Sandbox has a finite lifetime (E2B timeout + autoPause)
**Accepted risk:** This is the standard trade-off for OAuth-based git
integrations (same as GitHub Codespaces, Gitpod, etc.). The user consents
to `repo` scope at sign-in time. Document this in user-facing terms in
the product's security/privacy page.
### Future: Fine-grained tokens
GitHub supports fine-grained personal access tokens scoped to specific
repos. A future improvement could mint per-repo tokens instead of using
the user's full OAuth token. This requires the user to create and manage
fine-grained tokens, which adds friction. Evaluate based on user feedback.
## Implementation Order
1. Add `task_owner` table to task actor schema + migration
2. Add `primaryUserLogin` / `primaryUserAvatarUrl` to `taskSummaries` schema + migration
3. Implement owner swap in `sendWorkspaceMessage` flow
4. Implement credential injection in `ensureSandboxRepo`
5. Implement credential swap via `runProcess` on owner change
6. Implement user removal cleanup in org actor
7. Add "Overview" tab to right sidebar
8. Add owner display to left sidebar task items
9. Add owner picker dropdown in Overview tab
10. Update org coordinator to propagate owner in task summaries
## Files to Modify
### Backend
- `foundry/packages/backend/src/actors/task/db/schema.ts` -- add `task_owner` table
- `foundry/packages/backend/src/actors/task/db/migrations.ts` -- add migration
- `foundry/packages/backend/src/actors/organization/db/schema.ts` -- add owner columns to `taskSummaries`
- `foundry/packages/backend/src/actors/organization/db/migrations.ts` -- add migration
- `foundry/packages/backend/src/actors/task/workspace.ts` -- owner swap logic in `sendWorkspaceMessage`, credential injection in `ensureSandboxRepo`
- `foundry/packages/backend/src/actors/task/workflow/index.ts` -- wire owner swap action
- `foundry/packages/backend/src/actors/organization/actions/task-mutations.ts` -- propagate owner in summaries
- `foundry/packages/backend/src/actors/organization/actions/tasks.ts` -- `sendWorkspaceMessage` owner check
- `foundry/packages/backend/src/services/better-auth.ts` -- expose `getAccessTokenForSession` for owner lookup
### Shared
- `foundry/packages/shared/src/types.ts` -- add `primaryUserLogin` to `TaskSummary`
### Frontend
- `foundry/packages/frontend/src/components/mock-layout/right-sidebar.tsx` -- add Overview tab
- `foundry/packages/frontend/src/components/organization-dashboard.tsx` -- show owner in sidebar task items
- `foundry/packages/frontend/src/components/mock-layout.tsx` -- wire Overview tab state