sandbox-agent/foundry/research/friction/general.mdx

# General Friction Log

## 2026-03-13 - uncommitted

### What I Was Working On

Debugging slow GitHub OAuth sign-in in production after deploying backend request logging (d0ed0a4).

### Friction / Issue

Production logs showed two separate HTTP requests (different request IDs, ~9s apart) hitting `GET /v1/auth/github/callback` with the same `code` and `state` parameters. The first request succeeded (`exchangeCode` returned a token) but took ~18s total due to `syncGithubSessionFromToken` making multiple sequential GitHub API calls. The second request arrived while the first was still syncing, passed the oauth state validation (state was never cleared), and attempted `exchangeCode` with the already-consumed code, which GitHub rejected with `bad_verification_code`.

The root cause of the duplicate HTTP request is unknown. It is not `appWorkspaceAction` (no retry logic in the current version), not Railway proxy retry (no such config), and not a frontend double-navigation (the SPA is not involved during the OAuth redirect chain). Best hypothesis is the user refreshing during the ~18s blank page wait, but unconfirmed.

### Attempted Fix / Workaround

1. Made `completeAppGithubAuth` clear `oauthState`/`oauthStateExpiresAt` immediately after validation and before `exchangeCode`, so any duplicate request fails the state check instead of hitting GitHub with a consumed code.
2. Split `syncGithubSessionFromToken` into a fast path (`initGithubSession` — exchange code, get viewer, store token+identity) and a slow path (`syncGithubOrganizations` — list orgs, list installations, sync each organization).
3. `completeAppGithubAuth` now uses the fast path and enqueues the slow org sync to the organization workflow queue (`organization.command.syncGithubSession`, fire-and-forget). The HTTP callback returns a 302 redirect in ~2s instead of ~18s, eliminating the proxy timeout window.
4. The frontend already polls `getAppSnapshot` every 500ms when any org has `syncStatus === "syncing"`, so the deferred sync is transparent to the user.
5. `bootstrapAppGithubSession` (dev-only) still calls the full synchronous `syncGithubSessionFromToken` since proxy timeouts are not a concern in dev and it needs the session fully populated before returning.

### Outcome

- OAuth callback responds in ~2s (exchangeCode + getViewer) instead of ~18s.
- Proxy retry window is eliminated — no duplicate requests should occur.
- Duplicate requests are still guarded by the state-clearing idempotency check.
- Organization data populates asynchronously via the workflow queue; the frontend shows loading state and polls until complete.
- Root cause of the duplicate HTTP request (likely Railway/Cloudflare proxy retry on slow GET) remains uninvestigated but is no longer a practical problem.

## 2026-03-05 - uncommitted

### What I Was Working On

Verifying the BaseUI frontend against the real `rivet-dev/sandbox-agent-testing` repo, creating live PR-backed tasks, and driving the flow through the browser.

### Friction / Issue

Three separate issues stacked together during live verification:

1. A half-created task actor remained in repository indexes after earlier runtime failures. The actor state existed, but its durable task row did not, so repo overview polling spammed `Task not found` and kept trying to load an orphaned task.
2. Rebuilding the backend container outside `just dev` dropped injected GitHub auth, which made repo overview fall back to `Open PRs 0` until `GITHUB_TOKEN`/`GH_TOKEN` were passed back into `docker compose`.
3. In the create-task modal, the BaseUI-controlled form looked populated in the browser, but submit gating/click behavior was unreliable under browser automation, making it hard to distinguish frontend state bugs from backend failures.

### Attempted Fix / Workaround

1. Updated repository-actor stale task pruning to treat `Task not found:` the same as actor-not-found and rebuilt the backend image.
2. Recovered the orphaned task by forcing an initialize attempt, which surfaced a missing `body?.providerId` guard in the task init workflow and led to pruning the stale repository index row.
3. Recreated the backend with `GITHUB_TOKEN="$(gh auth token)" GH_TOKEN="$(gh auth token)" docker compose ... up -d --build backend` so PR sync could see live GitHub data again.
4. Used `agent-browser` plus screenshots to separate working paths (repo overview + PR visibility) from the remaining broken path (modal submit / task creation UI).

### Outcome

- Live repo overview now shows the real `sandbox-agent-testing` PRs again.
- The stale task actor no longer blocks repo overview polling.
- The remaining blocker is narrowed to the frontend create-task interaction path, plus missing agent API credentials for exercising real agent messaging end to end.

## 2026-03-06 - uncommitted

### What I Was Working On

Exercising the live selected-task UI end to end, including session creation, prompt send, and agent response rendering.

### Friction / Issue

The Docker dev backend container was starting on Bun `1.2.23` and accepting TCP connections on `7741`/`7750`, but every HTTP request stalled indefinitely. The same backend code responded immediately when started directly on the host with Bun `1.3.5`, so the hang was specific to the older Bun runtime in `docker/backend.dev.Dockerfile`.

### Attempted Fix / Workaround

1. Verified the stall both from the host and from inside the backend container with `curl`/`fetch`.
2. Started the backend directly on the host on an alternate port to confirm the code path itself was healthy.
3. Updated the dev backend image base from `oven/bun:1.2` to `oven/bun:1.3` so `docker compose` uses the working Bun line.

### Outcome

- Dev-runtime debugging is narrowed from "backend/UI path is broken" to a concrete Docker Bun version issue.
- After rebuild, the next verification step is the real selected-task transcript flow with agent messaging.

## 2026-02-17 - uncommitted

### What I Was Working On

Implementing Daytona snapshot-based sandbox creation and running required organization validation.

### Friction / Issue

The organization `node_modules` tree is partially root-owned in this environment. `pnpm install`/cleanup failed with `EACCES` and left missing local tool entrypoints (for example `turbo`/`typescript`), which blocked `pnpm -w typecheck/build/test` from running end-to-end.

### Attempted Fix / Workaround

1. Attempted organization reinstall (`pnpm install`, `CI=true pnpm install`) and package-level reinstall.
2. Attempted cleanup/recreate of `node_modules`, but root-owned files could not be removed.
3. Added temporary local shims for missing tool entrypoints to continue targeted validation.

### Outcome

- Daytona-specific changes and backend tests were validated.
- Full organization validation remains blocked until `node_modules` ownership is repaired (or container is recreated).

## 2026-02-16 - uncommitted

### What I Was Working On

Implementing git-spice-backed stack actions and repo overview in the frontend/actors.

### Friction / Issue

The `gs` binary on this environment resolves to Ghostscript (`/usr/bin/gs`), not git-spice. Relying on `gs` directly would execute the wrong tool and silently break stack actions.

### Attempted Fix / Workaround

1. Added git-spice command resolution that tries:
   - `HF_GIT_SPICE_BIN` override
   - `git-spice`
   - `git spice` (git plugin form)
2. Avoided `gs` as a default executable.
3. Added explicit unavailability messaging when git-spice is not installed.

### Outcome

- Stack actions no longer depend on ambiguous `gs` resolution.
- Backend behavior is predictable across environments with/without git-spice installed.

## 2026-02-12 - c2517f2

### What I Was Working On

Fixing Daytona `hf create` failures where `task.attach` would exhaust retries with `Task not found`.

### Friction / Issue

Foundry was using RivetKit's KV-backed durable SQLite VFS via `rivetkit/db/drizzle`, which opens the SQLite DB keyed by `ctx.actorId`. Since actor instances can be rescheduled (new `actorId`) between requests, DB writes from initialization were not visible to later actions (e.g. `attach`), causing “Task not found” errors and action timeouts.

Separately, importing `bun:sqlite` directly broke:

- `tsup` builds (esbuild can't resolve `bun:sqlite` unless externalized)
- `vitest` runs (Vite resolver can't resolve `bun:` specifiers)

### Attempted Fix / Workaround

- Switched backend actor DB provider to a shared on-disk SQLite database at `config.backend.dbPath` using Bun's `bun:sqlite` + Drizzle, with inline migrations and per-connection PRAGMAs.
- Hid Bun-only module resolution behind dynamic imports so `vitest` can load modules.
- Used the KV-backed DB provider only for Node/Vitest environments (tests), while Bun runtime uses the shared on-disk DB.

### Outcome

- Daytona `hf create` now completes and returns a valid session and `daytona://...` target.
- `pnpm -w typecheck`, `pnpm -w build`, and `pnpm -w test` are green.

## 2026-02-09 - uncommitted

### What I Was Working On

Making `hf`/backend Bun-native and integrating OpenTUI without a Node fallback path.

### Friction / Issue

OpenTUI (`@opentui/core`) could not run under Node due Bun-specific imports/assets (`bun:ffi`, `.scm` module loading), which broke `hf` default interactive mode.

### Attempted Fix / Workaround

1. Removed runtime assumptions that backend/CLI would execute under Node.
2. Switched CLI entrypoint and backend launch commands to Bun.
3. Updated docs and tooling guidance to require Bun for runtime execution.

### Outcome

- OpenTUI remains the single TUI path.
- Runtime expectations are explicit: Bun is required for `hf` interactive execution.

## 2026-02-09 - uncommitted

### What I Was Working On

Implementing `hf` backend auto-ensure/auto-restart-on-outdated behavior and adding CLI tests for backend lifecycle logic.

### Friction / Issue

Vitest ESM module namespace exports are non-configurable, so `vi.spyOn(childProcess, "spawn")` failed when testing backend launch behavior.

### Attempted Fix / Workaround

1. Replaced direct `spyOn` with a hoisted `vi.mock("node:child_process", ...)`.
2. Injected mocked `spawn`/`execFileSync` via the module mock.
3. Updated tests to assert lifecycle behavior through the mocked module functions.

### Outcome

- Backend manager tests are stable under ESM.
- Full organization tests pass with lifecycle coverage for outdated-backend restart behavior.

## 2026-02-08 - uncommitted

### What I Was Working On

Finalizing migration implementation and validation across code, docs, and tests.

### Friction / Issue

The environment did not provide `rg`, and docs/policy files still described Rust-era workflows after runtime migration.

### Attempted Fix / Workaround

1. Switched repository discovery to `find`/`grep`.
2. Rewrote repository guidance files (`CLAUDE.md`, `skills/SKILL.md`, docs, `SPEC.md`) to match the TypeScript architecture.
3. Added missing TUI test coverage so monorepo-wide test runs no longer fail on packages without tests.

### Outcome

- Full workflow is now documented around TypeScript + pnpm + Turborepo + RivetKit actors.
- Validation pipeline is runnable with one consistent command set.

## 2026-02-08 - uncommitted

### What I Was Working On

Running full organization test validation (`pnpm -w test`) for the migrated monorepo.

### Friction / Issue

Backend integration tests depend on native `better-sqlite3` bindings, which were unavailable in this environment.

### Attempted Fix / Workaround

1. Attempted `pnpm --filter @sandbox-agent/foundry-backend rebuild better-sqlite3`.
2. Added runtime capability detection in DB-backed backend tests.
3. Marked DB-backed tests with `it.skipIf(!hasBetterSqliteBinding)` so tests run when native bindings exist and skip cleanly otherwise.

### Outcome

- Full organization test suite passes consistently.
- Backend unit coverage always runs; DB integration tests run automatically on environments with native bindings.

## 2026-02-09 - aab1012 (working tree)

### What I Was Working On

Cleaning up CLI UX noise while validating `hf` flows repeatedly.

### Friction / Issue

Bun emitted a warning on every `hf` invocation due unsupported wildcard `sideEffects` patterns in vendored RivetKit `package.json`.

### Attempted Fix / Workaround

1. Replaced wildcard `sideEffects` array in `packages/rivetkit-vendor/rivetkit/package.json` with `false`.

### Outcome

- Per-command warning spam is gone.
- `hf` command output is now readable during normal usage and smoke testing.

## 2026-02-09 - aab1012 (working tree)

### What I Was Working On

Fixing `hf` launch behavior after `just install` when OpenTUI assets were loaded under Node.

### Friction / Issue

Global launcher resolution depended on pnpm global bin + shell PATH state. In environments where Bun was not on PATH (or where another `hf` shim was used), CLI could execute under Node and fail with:

- `Unknown file extension ".scm"` from `@opentui/core/assets/...`

### Attempted Fix / Workaround

1. Updated `just install` to install a deterministic launcher at `~/.local/bin/hf`.
2. Launcher explicitly resolves Bun from `$HF_BUN` or `~/.bun/bin/bun` (with `command -v bun` fallback).
3. Launcher exits with a clear Bun-required error if Bun is unavailable.

### Outcome

- `hf` runs through Bun consistently after install, independent of pnpm global-bin PATH quirks.
- OpenTUI `.scm` asset load no longer goes through Node.

## 2026-02-09 - aab1012 (working tree)

### What I Was Working On

Eliminating `.scm` loader failures when `hf` is accidentally launched via Node.

### Friction / Issue

Even with Bun-first install scripts, user shells can still invoke `hf` through stale/hash/alias Node-based launch paths, causing OpenTUI asset load failure:

- `ERR_UNKNOWN_FILE_EXTENSION .scm`

### Attempted Fix / Workaround

1. Added CLI bootstrap guard in `packages/cli/src/index.ts`:
   - If runtime is not Bun, re-exec with Bun (`$HF_BUN`, `~/.bun/bin/bun`, then `bun` on PATH).
2. Deferred OpenTUI import to dynamic import (`import("./tui.js")`) so Node can reach the bootstrap guard before loading OpenTUI assets.

### Outcome

- `node packages/cli/dist/index.js --help` now works (auto re-execs to Bun).
- `.scm` extension crash path is eliminated even when launcher is Node-based.

## 2026-02-17 - uncommitted

### What I Was Working On

Validating new git-spice stack integration tests under `HF_ENABLE_ACTOR_INTEGRATION_TESTS=1`.

### Friction / Issue

Running backend tests with the integration flag enabled triggered unrelated actor integration suites and produced long noisy failures (`Failed query ...`, `memory access out of bounds`) unrelated to the stack changes, making targeted validation difficult.

### Attempted Fix / Workaround

1. Switched to package-targeted test runs for deterministic coverage (`@sandbox-agent/foundry-backend` + `@sandbox-agent/foundry-frontend`).
2. Relied on required organization validation (`pnpm -w typecheck`, `pnpm -w build`, `pnpm -w test`) plus targeted stack test files.
3. Stopped the runaway integration run and recorded this friction for follow-up.

### Outcome

- New stack-focused tests pass in deterministic targeted runs.
- Full required organization checks pass.
- Integration-gated suite remains noisy and needs separate stabilization.

## 2026-03-05 - uncommitted

### What I Was Working On

Reviewing architecture for simplification opportunities.

### Friction / Issue

Considered merging `repositoryPrSync` (30s) and `repositoryBranchSync` (5s) into a single `repositorySync` actor that polls at the faster cadence and does PR fetches every Nth tick. This would reduce actor count by one per repo but violates the single-responsibility-per-actor pattern established in the codebase. Mixed cadences within one actor add conditional tick logic, make the polling intervals harder to reason about independently, and couple two unrelated data sources (git branches vs GitHub API) into one failure domain.

### Attempted Fix / Workaround

None — rejected the idea during review.

### Outcome

- Keep `repositoryPrSync` and `repositoryBranchSync` as separate actors.
- Single-responsibility-per-sync-actor is the right pattern for this codebase.

## 2026-03-06 - 77341ff

### What I Was Working On

Bringing up the Docker-based local dev stack with `just dev` after the BaseUI frontend migration.

### Friction / Issue

Docker Desktop recovered, but the frontend container failed immediately with `Cannot find module @rollup/rollup-linux-arm64-gnu`. The dev compose setup bind-mounted the host organization into `/app`, so the Linux container picked up macOS `node_modules` and missed Rollup's Linux optional package.

### Attempted Fix / Workaround

1. Confirmed Docker itself was healthy again by checking the Unix socket, `docker version`, and the backend health endpoint.
2. Reproduced the frontend crash inside `docker compose`.
3. Changed the frontend dev service to use named volumes for organization `node_modules` and the pnpm store, and to run `pnpm install --frozen-lockfile` inside the container before starting Vite.

### Outcome

- Docker engine startup was restored.
- The compose stack no longer depends on host-architecture frontend dependencies.
- `just dev` can proceed to start the backend and Linux-native frontend services cleanly.

## 2026-03-06 - uncommitted

### What I Was Working On

Verifying the selected-task UI flow end to end in the browser: create repo, create task, select the task, start an agent session, and send a follow-up message.

### Friction / Issue

Local dev hit three stacked runtime issues during live UI verification:

1. The frontend’s Vite proxy and the backend/manager startup order were brittle enough that `/api/rivet/metadata` or the manager port `7750` could briefly hang or refuse connections during restarts, which made browser verification look flaky even when the backend eventually came up.
2. The new local sandbox provider initially persisted only the sandbox-agent endpoint, not its bearer token, so ACP session creation later failed with `401 Token Invalid`.
3. The exported local `OPENAI_API_KEY` / `CODEX_API_KEY` credentials came from local ChatGPT/Codex auth state but did not include the `api.responses.write` scope required by Codex ACP, so the agent session could start but failed when the model tried to answer.

### Attempted Fix / Workaround

1. Added permissive CORS on the backend wrapper and iterated on live browser verification until the wrapper + manager startup sequence was stable again.
2. Updated the local provider to return both sandbox-agent `endpoint` and `token`.
3. Updated `sandbox-instance` to refresh local-provider agent credentials instead of trusting stale persisted metadata across backend restarts.
4. Stopped injecting `OPENAI_API_KEY` / `CODEX_API_KEY` into the host-local sandbox-agent process so local Codex can fall back to machine-native auth instead of the under-scoped exported token.

### Outcome

- The browser flow now reaches the real selected-task transcript screen.
- Task creation and initial session creation work in the UI against the local provider.
- A remaining upstream auth/runtime blocker still prevents a clean verified assistant text response in the final follow-up-message step, so that part of the end-to-end flow is not yet reliable enough to claim complete.