sandbox-agent/foundry/research/memory-investigation.md
Nathan Flurry ee99d0b318 feat(foundry): memory investigation tooling and VFS pool spec
Add memory monitoring instrumentation, investigation findings, and
SQLite VFS pool design spec for addressing WASM SQLite memory spikes.

- Add /debug/memory endpoint and periodic memory logging (dev only)
- Add mem-monitor.sh script for continuous memory profiling with
  automatic heap snapshot capture on spike detection
- Add configureRunnerPool to registry setup for engine driver support
- Document memory investigation findings (per-actor cost, spike behavior)
- Write SQLite VFS pool spec for bin-packing actors onto shared WASM instances
- Add foundry-mem-monitor and foundry-dev-engine justfile recipes
- Add compose.dev.yaml engine driver and platform support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:46:03 -07:00

3.9 KiB

Foundry Backend Memory Investigation

Date: 2026-03-17

Problem

Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace.

Architecture

Each actor in the system has two SQLite instances:

  1. WASM SQLite (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own SqliteVfs which instantiates a full WebAssembly.Instance with 16.6 MB linear memory.

  2. Native bun:sqlite (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory).

Findings

Memory breakdown (steady state, 14 active WASM instances)

Category Size % of RSS Description
WASM SQLite heaps 232 MB 46% 14 x 16.6 MB ArrayBuffers (WASM linear memory)
Bun native (bun:sqlite + runtime) 225 MB 44% KV backing store page caches, mmap'd WAL files, Bun runtime
JS application objects 27 MB 5% Closures, actor state, plain objects
Module graph 20 MB 4% Compiled code, FunctionCodeBlocks, ModuleRecords
ArrayBuffer intermediates 4 MB 1% Non-WASM buffers
KV data in transit ~0 MB 0% 4KB chunks copied and freed immediately

Spike behavior

When opening a task workspace, many actors wake simultaneously:

State WASM Instances SqliteVfs WASM Heap Actors (task) RSS
Baseline 7-9 6-8 116-149 MB 14 289-309 MB
Spike 32 32 531 MB 25 2,118 MB
Post-sleep 14 13 232 MB 25 (23 sleeping) 509 MB

Per-actor memory cost

Each actor that wakes up and accesses its database costs:

  • 16.6 MB for WASM SQLite linear memory
  • ~4-8 MB for native bun:sqlite KV backing store
  • Total: ~20-25 MB per actor

No per-actor WASM leak

Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep:

  • Wake: +1 SqliteVfs, +17 MB
  • Sleep: -1 SqliteVfs, -17 MB
  • No accumulation across cycles

Production impact

With 200+ PRs in production, if something wakes all task actors simultaneously:

  • 200 actors x 25 MB = 5 GB minimum
  • Plus JS garbage from git operations, sandbox bootstraps, etc.
  • Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure)

The double-SQLite problem

The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native:

Actor Drizzle queries
    -> WASM SQLite (16.6 MB heap)
        -> VFS layer (copies 4KB chunks)
            -> KV store API
                -> bun:sqlite (native, ~4-8 MB page cache)
                    -> disk (.db files)

The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly.

Root causes of mass actor wake-up

  1. maybeScheduleWorkspaceRefreshes() is called twice per getTaskDetail() (once directly, once via buildTaskSummary())
  2. getWorkspace() fetches ALL task details in parallel, waking all task actors Dead code — removed 2026-03-17. The frontend uses the subscription system exclusively; getWorkspaceCompat and RemoteWorkspaceStore had zero callers.
  3. Frontend retry interval is 1 second with no backoff
  4. No deduplication of concurrent collectWorkspaceGitState() calls

Next steps

  • Test with engine driver enabled to measure WASM elimination impact
  • Investigate what triggers mass actor wake-up in production (the getWorkspace fan-out was dead code; the actual trigger is still unknown)
  • Consider sharing a single WASM module across actors (mutex around non-reentrant init)
  • Enable periodic memory logging in production to capture state before OOM kills