# Foundry Backend Memory Investigation Date: 2026-03-17 ## Problem Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace. ## Architecture Each actor in the system has **two SQLite instances**: 1. **WASM SQLite** (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own `SqliteVfs` which instantiates a full `WebAssembly.Instance` with 16.6 MB linear memory. 2. **Native bun:sqlite** (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory). ## Findings ### Memory breakdown (steady state, 14 active WASM instances) | Category | Size | % of RSS | Description | |----------|------|----------|-------------| | WASM SQLite heaps | 232 MB | 46% | 14 x 16.6 MB ArrayBuffers (WASM linear memory) | | Bun native (bun:sqlite + runtime) | 225 MB | 44% | KV backing store page caches, mmap'd WAL files, Bun runtime | | JS application objects | 27 MB | 5% | Closures, actor state, plain objects | | Module graph | 20 MB | 4% | Compiled code, FunctionCodeBlocks, ModuleRecords | | ArrayBuffer intermediates | 4 MB | 1% | Non-WASM buffers | | KV data in transit | ~0 MB | 0% | 4KB chunks copied and freed immediately | ### Spike behavior When opening a task workspace, many actors wake simultaneously: | State | WASM Instances | SqliteVfs | WASM Heap | Actors (task) | RSS | |-------|---------------|-----------|-----------|---------------|-----| | Baseline | 7-9 | 6-8 | 116-149 MB | 14 | 289-309 MB | | Spike | 32 | 32 | 531 MB | 25 | 2,118 MB | | Post-sleep | 14 | 13 | 232 MB | 25 (23 sleeping) | 509 MB | ### Per-actor memory cost Each actor that wakes up and accesses its database costs: - 16.6 MB for WASM SQLite linear memory - ~4-8 MB for native bun:sqlite KV backing store - **Total: ~20-25 MB per actor** ### No per-actor WASM leak Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep: - Wake: +1 SqliteVfs, +17 MB - Sleep: -1 SqliteVfs, -17 MB - No accumulation across cycles ### Production impact With 200+ PRs in production, if something wakes all task actors simultaneously: - 200 actors x 25 MB = 5 GB minimum - Plus JS garbage from git operations, sandbox bootstraps, etc. - Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure) ### The double-SQLite problem The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native: ``` Actor Drizzle queries -> WASM SQLite (16.6 MB heap) -> VFS layer (copies 4KB chunks) -> KV store API -> bun:sqlite (native, ~4-8 MB page cache) -> disk (.db files) ``` The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly. ## Root causes of mass actor wake-up 1. `maybeScheduleWorkspaceRefreshes()` is called twice per `getTaskDetail()` (once directly, once via `buildTaskSummary()`) 2. ~~`getWorkspace()` fetches ALL task details in parallel, waking all task actors~~ **Dead code — removed 2026-03-17.** The frontend uses the subscription system exclusively; `getWorkspaceCompat` and `RemoteWorkspaceStore` had zero callers. 3. Frontend retry interval is 1 second with no backoff 4. No deduplication of concurrent `collectWorkspaceGitState()` calls ## Next steps - [ ] Test with engine driver enabled to measure WASM elimination impact - [ ] Investigate what triggers mass actor wake-up in production (the `getWorkspace` fan-out was dead code; the actual trigger is still unknown) - [ ] Consider sharing a single WASM module across actors (mutex around non-reentrant init) - [ ] Enable periodic memory logging in production to capture state before OOM kills