mirror of
https://github.com/harivansh-afk/sandbox-agent.git
synced 2026-04-15 09:01:17 +00:00
Add memory monitoring instrumentation, investigation findings, and SQLite VFS pool design spec for addressing WASM SQLite memory spikes. - Add /debug/memory endpoint and periodic memory logging (dev only) - Add mem-monitor.sh script for continuous memory profiling with automatic heap snapshot capture on spike detection - Add configureRunnerPool to registry setup for engine driver support - Document memory investigation findings (per-actor cost, spike behavior) - Write SQLite VFS pool spec for bin-packing actors onto shared WASM instances - Add foundry-mem-monitor and foundry-dev-engine justfile recipes - Add compose.dev.yaml engine driver and platform support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
88 lines
3.9 KiB
Markdown
88 lines
3.9 KiB
Markdown
# Foundry Backend Memory Investigation
|
|
|
|
Date: 2026-03-17
|
|
|
|
## Problem
|
|
|
|
Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace.
|
|
|
|
## Architecture
|
|
|
|
Each actor in the system has **two SQLite instances**:
|
|
|
|
1. **WASM SQLite** (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own `SqliteVfs` which instantiates a full `WebAssembly.Instance` with 16.6 MB linear memory.
|
|
|
|
2. **Native bun:sqlite** (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory).
|
|
|
|
## Findings
|
|
|
|
### Memory breakdown (steady state, 14 active WASM instances)
|
|
|
|
| Category | Size | % of RSS | Description |
|
|
|----------|------|----------|-------------|
|
|
| WASM SQLite heaps | 232 MB | 46% | 14 x 16.6 MB ArrayBuffers (WASM linear memory) |
|
|
| Bun native (bun:sqlite + runtime) | 225 MB | 44% | KV backing store page caches, mmap'd WAL files, Bun runtime |
|
|
| JS application objects | 27 MB | 5% | Closures, actor state, plain objects |
|
|
| Module graph | 20 MB | 4% | Compiled code, FunctionCodeBlocks, ModuleRecords |
|
|
| ArrayBuffer intermediates | 4 MB | 1% | Non-WASM buffers |
|
|
| KV data in transit | ~0 MB | 0% | 4KB chunks copied and freed immediately |
|
|
|
|
### Spike behavior
|
|
|
|
When opening a task workspace, many actors wake simultaneously:
|
|
|
|
| State | WASM Instances | SqliteVfs | WASM Heap | Actors (task) | RSS |
|
|
|-------|---------------|-----------|-----------|---------------|-----|
|
|
| Baseline | 7-9 | 6-8 | 116-149 MB | 14 | 289-309 MB |
|
|
| Spike | 32 | 32 | 531 MB | 25 | 2,118 MB |
|
|
| Post-sleep | 14 | 13 | 232 MB | 25 (23 sleeping) | 509 MB |
|
|
|
|
### Per-actor memory cost
|
|
|
|
Each actor that wakes up and accesses its database costs:
|
|
- 16.6 MB for WASM SQLite linear memory
|
|
- ~4-8 MB for native bun:sqlite KV backing store
|
|
- **Total: ~20-25 MB per actor**
|
|
|
|
### No per-actor WASM leak
|
|
|
|
Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep:
|
|
- Wake: +1 SqliteVfs, +17 MB
|
|
- Sleep: -1 SqliteVfs, -17 MB
|
|
- No accumulation across cycles
|
|
|
|
### Production impact
|
|
|
|
With 200+ PRs in production, if something wakes all task actors simultaneously:
|
|
- 200 actors x 25 MB = 5 GB minimum
|
|
- Plus JS garbage from git operations, sandbox bootstraps, etc.
|
|
- Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure)
|
|
|
|
### The double-SQLite problem
|
|
|
|
The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native:
|
|
|
|
```
|
|
Actor Drizzle queries
|
|
-> WASM SQLite (16.6 MB heap)
|
|
-> VFS layer (copies 4KB chunks)
|
|
-> KV store API
|
|
-> bun:sqlite (native, ~4-8 MB page cache)
|
|
-> disk (.db files)
|
|
```
|
|
|
|
The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly.
|
|
|
|
## Root causes of mass actor wake-up
|
|
|
|
1. `maybeScheduleWorkspaceRefreshes()` is called twice per `getTaskDetail()` (once directly, once via `buildTaskSummary()`)
|
|
2. ~~`getWorkspace()` fetches ALL task details in parallel, waking all task actors~~ **Dead code — removed 2026-03-17.** The frontend uses the subscription system exclusively; `getWorkspaceCompat` and `RemoteWorkspaceStore` had zero callers.
|
|
3. Frontend retry interval is 1 second with no backoff
|
|
4. No deduplication of concurrent `collectWorkspaceGitState()` calls
|
|
|
|
## Next steps
|
|
|
|
- [ ] Test with engine driver enabled to measure WASM elimination impact
|
|
- [ ] Investigate what triggers mass actor wake-up in production (the `getWorkspace` fan-out was dead code; the actual trigger is still unknown)
|
|
- [ ] Consider sharing a single WASM module across actors (mutex around non-reentrant init)
|
|
- [ ] Enable periodic memory logging in production to capture state before OOM kills
|