mirror of https://github.com/harivansh-afk/sandbox-agent.git synced 2026-04-15 04:03:31 +00:00

Nathan Flurry ee99d0b318 feat(foundry): memory investigation tooling and VFS pool spec

Add memory monitoring instrumentation, investigation findings, and
SQLite VFS pool design spec for addressing WASM SQLite memory spikes.

- Add /debug/memory endpoint and periodic memory logging (dev only)
- Add mem-monitor.sh script for continuous memory profiling with
  automatic heap snapshot capture on spike detection
- Add configureRunnerPool to registry setup for engine driver support
- Document memory investigation findings (per-actor cost, spike behavior)
- Write SQLite VFS pool spec for bin-packing actors onto shared WASM instances
- Add foundry-mem-monitor and foundry-dev-engine justfile recipes
- Add compose.dev.yaml engine driver and platform support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-17 23:46:03 -07:00

3.9 KiB

Raw Blame History

Foundry Backend Memory Investigation

Date: 2026-03-17

Problem

Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace.

Architecture

Each actor in the system has two SQLite instances:

WASM SQLite (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own SqliteVfs which instantiates a full WebAssembly.Instance with 16.6 MB linear memory.
Native bun:sqlite (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory).

Findings

Memory breakdown (steady state, 14 active WASM instances)

Category	Size	% of RSS	Description
WASM SQLite heaps	232 MB	46%	14 x 16.6 MB ArrayBuffers (WASM linear memory)
Bun native (bun:sqlite + runtime)	225 MB	44%	KV backing store page caches, mmap'd WAL files, Bun runtime
JS application objects	27 MB	5%	Closures, actor state, plain objects
Module graph	20 MB	4%	Compiled code, FunctionCodeBlocks, ModuleRecords
ArrayBuffer intermediates	4 MB	1%	Non-WASM buffers
KV data in transit	~0 MB	0%	4KB chunks copied and freed immediately

Spike behavior

When opening a task workspace, many actors wake simultaneously:

State	WASM Instances	SqliteVfs	WASM Heap	Actors (task)	RSS
Baseline	7-9	6-8	116-149 MB	14	289-309 MB
Spike	32	32	531 MB	25	2,118 MB
Post-sleep	14	13	232 MB	25 (23 sleeping)	509 MB

Per-actor memory cost

Each actor that wakes up and accesses its database costs:

16.6 MB for WASM SQLite linear memory
~4-8 MB for native bun:sqlite KV backing store
Total: ~20-25 MB per actor

No per-actor WASM leak

Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep:

Wake: +1 SqliteVfs, +17 MB
Sleep: -1 SqliteVfs, -17 MB
No accumulation across cycles

Production impact

With 200+ PRs in production, if something wakes all task actors simultaneously:

200 actors x 25 MB = 5 GB minimum
Plus JS garbage from git operations, sandbox bootstraps, etc.
Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure)

The double-SQLite problem

The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native:

Actor Drizzle queries
    -> WASM SQLite (16.6 MB heap)
        -> VFS layer (copies 4KB chunks)
            -> KV store API
                -> bun:sqlite (native, ~4-8 MB page cache)
                    -> disk (.db files)

The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly.

Root causes of mass actor wake-up

maybeScheduleWorkspaceRefreshes() is called twice per getTaskDetail() (once directly, once via buildTaskSummary())
~~getWorkspace() fetches ALL task details in parallel, waking all task actors~~ Dead code — removed 2026-03-17. The frontend uses the subscription system exclusively; getWorkspaceCompat and RemoteWorkspaceStore had zero callers.
Frontend retry interval is 1 second with no backoff
No deduplication of concurrent collectWorkspaceGitState() calls

Next steps

Test with engine driver enabled to measure WASM elimination impact
Investigate what triggers mass actor wake-up in production (the getWorkspace fan-out was dead code; the actual trigger is still unknown)
Consider sharing a single WASM module across actors (mutex around non-reentrant init)
Enable periodic memory logging in production to capture state before OOM kills

3.9 KiB Raw Blame History