feat(foundry): memory investigation tooling and VFS pool spec

Add memory monitoring instrumentation, investigation findings, and SQLite VFS pool design spec for addressing WASM SQLite memory spikes. - Add /debug/memory endpoint and periodic memory logging (dev only) - Add mem-monitor.sh script for continuous memory profiling with automatic heap snapshot capture on spike detection - Add configureRunnerPool to registry setup for engine driver support - Document memory investigation findings (per-actor cost, spike behavior) - Write SQLite VFS pool spec for bin-packing actors onto shared WASM instances - Add foundry-mem-monitor and foundry-dev-engine justfile recipes - Add compose.dev.yaml engine driver and platform support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:01:27 +00:00 · 2026-03-17 23:46:03 -07:00 · 2026-03-17 23:46:03 -07:00 · ee99d0b318
commit ee99d0b318
parent 7b23e519c2
18 changed files with 888 additions and 496 deletions
--- a/foundry/research/memory-investigation.md
+++ b/foundry/research/memory-investigation.md
@ -0,0 +1,88 @@
+# Foundry Backend Memory Investigation
+
+Date: 2026-03-17
+
+## Problem
+
+Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace.
+
+## Architecture
+
+Each actor in the system has **two SQLite instances**:
+
+1. **WASM SQLite** (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own `SqliteVfs` which instantiates a full `WebAssembly.Instance` with 16.6 MB linear memory.
+
+2. **Native bun:sqlite** (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory).
+
+## Findings
+
+### Memory breakdown (steady state, 14 active WASM instances)
+
+| Category | Size | % of RSS | Description |
+|----------|------|----------|-------------|
+| WASM SQLite heaps | 232 MB | 46% | 14 x 16.6 MB ArrayBuffers (WASM linear memory) |
+| Bun native (bun:sqlite + runtime) | 225 MB | 44% | KV backing store page caches, mmap'd WAL files, Bun runtime |
+| JS application objects | 27 MB | 5% | Closures, actor state, plain objects |
+| Module graph | 20 MB | 4% | Compiled code, FunctionCodeBlocks, ModuleRecords |
+| ArrayBuffer intermediates | 4 MB | 1% | Non-WASM buffers |
+| KV data in transit | ~0 MB | 0% | 4KB chunks copied and freed immediately |
+
+### Spike behavior
+
+When opening a task workspace, many actors wake simultaneously:
+
+| State | WASM Instances | SqliteVfs | WASM Heap | Actors (task) | RSS |
+|-------|---------------|-----------|-----------|---------------|-----|
+| Baseline | 7-9 | 6-8 | 116-149 MB | 14 | 289-309 MB |
+| Spike | 32 | 32 | 531 MB | 25 | 2,118 MB |
+| Post-sleep | 14 | 13 | 232 MB | 25 (23 sleeping) | 509 MB |
+
+### Per-actor memory cost
+
+Each actor that wakes up and accesses its database costs:
+- 16.6 MB for WASM SQLite linear memory
+- ~4-8 MB for native bun:sqlite KV backing store
+- **Total: ~20-25 MB per actor**
+
+### No per-actor WASM leak
+
+Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep:
+- Wake: +1 SqliteVfs, +17 MB
+- Sleep: -1 SqliteVfs, -17 MB
+- No accumulation across cycles
+
+### Production impact
+
+With 200+ PRs in production, if something wakes all task actors simultaneously:
+- 200 actors x 25 MB = 5 GB minimum
+- Plus JS garbage from git operations, sandbox bootstraps, etc.
+- Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure)
+
+### The double-SQLite problem
+
+The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native:
+
+```
+Actor Drizzle queries
+    -> WASM SQLite (16.6 MB heap)
+        -> VFS layer (copies 4KB chunks)
+            -> KV store API
+                -> bun:sqlite (native, ~4-8 MB page cache)
+                    -> disk (.db files)
+```
+
+The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly.
+
+## Root causes of mass actor wake-up
+
+1. `maybeScheduleWorkspaceRefreshes()` is called twice per `getTaskDetail()` (once directly, once via `buildTaskSummary()`)
+2. ~~`getWorkspace()` fetches ALL task details in parallel, waking all task actors~~ **Dead code — removed 2026-03-17.** The frontend uses the subscription system exclusively; `getWorkspaceCompat` and `RemoteWorkspaceStore` had zero callers.
+3. Frontend retry interval is 1 second with no backoff
+4. No deduplication of concurrent `collectWorkspaceGitState()` calls
+
+## Next steps
+
+- [ ] Test with engine driver enabled to measure WASM elimination impact
+- [ ] Investigate what triggers mass actor wake-up in production (the `getWorkspace` fan-out was dead code; the actual trigger is still unknown)
+- [ ] Consider sharing a single WASM module across actors (mutex around non-reentrant init)
+- [ ] Enable periodic memory logging in production to capture state before OOM kills
--- a/foundry/research/sqlite-vfs-pool-spec.md
+++ b/foundry/research/sqlite-vfs-pool-spec.md
@ -0,0 +1,214 @@
+# SQLite VFS Pool Spec
+
+Date: 2026-03-17
+Package: `@rivetkit/sqlite-vfs`
+Scope: WASM SQLite only (not Cloudflare D1 driver)
+
+## Problem
+
+Each actor gets its own WASM SQLite instance via `SqliteVfs`, allocating 16.6 MB
+of linear memory per instance. With 200+ actors waking simultaneously, this
+causes multi-GB memory spikes (40 GB observed in production).
+
+## Design
+
+### Pool model
+
+A `SqliteVfsPool` manages N WASM SQLite instances. Actors are bin-packed onto
+instances via sticky assignment. The pool scales instances up to a configured
+max as actors arrive, and scales down (after a grace period) when instances have
+zero assigned actors.
+
+### Configuration
+
+```typescript
+interface SqliteVfsPoolConfig {
+  /** Max actors sharing one WASM instance. Default: 50. */
+  actorsPerInstance: number;
+  /** Max WASM instances the pool will create. Default: Infinity. */
+  maxInstances?: number;
+  /** Grace period before destroying an empty instance. Default: 30_000ms. */
+  idleDestroyMs?: number;
+}
+```
+
+**Sizing guide**: each WASM instance handles ~13 SQLite ops/sec at 15ms KV RTT
+(66 KV ops/sec / ~5 KV ops per SQLite operation). For a target of X ops/sec,
+set `actorsPerInstance = totalActors / ceil(X / 13)`.
+
+### Actor-to-instance assignment
+
+Sticky assignment: once an actor is assigned to an instance, it stays there
+until it releases (actor sleep/destroy). Assignment uses bin-packing: pick the
+instance with the most actors that still has capacity. If all instances are
+full, create a new one (up to `maxInstances`).
+
+```
+acquire(actorId) -> PooledSqliteHandle
+  1. If actorId already assigned, return existing handle
+  2. Find instance with most actors that has capacity (< actorsPerInstance)
+  3. If none found and instanceCount < maxInstances, create new instance
+  4. If none found and at max, wait (queue)
+  5. Assign actorId to instance, return handle
+
+release(actorId)
+  1. Remove actorId from instance's assignment set
+  2. If instance has zero actors, start idle timer
+  3. On idle timer expiry, destroy instance (reclaim 16.6 MB)
+  4. Cancel idle timer if a new actor is assigned before expiry
+```
+
+### Locking mechanism
+
+The existing `#sqliteMutex` on `SqliteVfs` already serializes SQLite operations
+within one instance. This is the right level: each individual xRead/xWrite call
+acquires the mutex, does its async KV operation, and releases. No change needed
+to the mutex itself.
+
+Multiple databases on the same instance share the mutex. This means if actor A
+is doing an xRead (15ms), actor B on the same instance waits. This is the
+intentional serialization — asyncify cannot handle concurrent suspensions on the
+same WASM module.
+
+The pool does NOT add a higher-level lock. The per-instance `#sqliteMutex`
+handles all serialization. The pool only manages assignment and lifecycle.
+
+### Multiple databases per instance
+
+Currently `SqliteSystem.registerFile()` enforces one main database file per VFS.
+This constraint must be lifted to allow multiple actors' databases to coexist.
+
+**Change**: `SqliteSystem` tracks multiple registered files in a `Map<string, KvVfsOptions>`
+instead of a single `#mainFileName`. The VFS callbacks (`xRead`, `xWrite`, etc.)
+already receive the file handle and look up the correct options per file.
+
+Each actor opens its own database file (named by actorId) on the shared VFS.
+Multiple databases can be open simultaneously on the same WASM instance. The
+`#sqliteMutex` ensures only one SQLite call executes at a time.
+
+### PooledSqliteHandle
+
+The handle returned to actors wraps a reference to the pool and its assigned
+instance. It exposes the same `open()` interface as `SqliteVfs`.
+
+```typescript
+class PooledSqliteHandle {
+  readonly #pool: SqliteVfsPool;
+  readonly #instanceId: number;
+  readonly #actorId: string;
+
+  /** Open a database on this handle's assigned WASM instance. */
+  async open(fileName: string, options: KvVfsOptions): Promise<Database> {
+    const vfs = this.#pool.getInstance(this.#instanceId);
+    return vfs.open(fileName, options);
+  }
+
+  /** Release this handle back to the pool. */
+  async destroy(): Promise<void> {
+    this.#pool.release(this.#actorId);
+  }
+}
+```
+
+### Integration with drivers
+
+The `ActorDriver.createSqliteVfs()` method currently returns `new SqliteVfs()`.
+With pooling:
+
+```typescript
+// Before
+async createSqliteVfs(): Promise<SqliteVfs> {
+  return new SqliteVfs();
+}
+
+// After
+async createSqliteVfs(actorId: string): Promise<PooledSqliteHandle> {
+  return this.#vfsPool.acquire(actorId);
+}
+```
+
+The `PooledSqliteHandle` must satisfy the same interface that actors expect from
+`SqliteVfs` (specifically the `open()` and `destroy()` methods). Either:
+- `PooledSqliteHandle` implements the `SqliteVfs` interface (duck typing)
+- Or extract an interface type that both implement
+
+The actor instance code in `mod.ts` calls `this.#sqliteVfs = await driver.createSqliteVfs()`.
+It then passes `this.#sqliteVfs` to the DB provider which calls `.open()`. On
+cleanup it calls `.destroy()`. The pooled handle supports both.
+
+### Scale-up and scale-down
+
+**Scale-up**: new instance created lazily on `acquire()` when all existing
+instances are at capacity. WASM module is loaded in `#ensureInitialized()` on
+first `open()` call (existing lazy behavior). Cost: ~16.6 MB + WASM compile time.
+
+**Scale-down**: when last actor releases from an instance, start a timer
+(`idleDestroyMs`). If no new actor is assigned before the timer fires, call
+`sqliteVfs.destroy()` to free the WASM module. This reclaims 16.6 MB.
+
+If an actor is assigned to an instance that is in the idle-destroy grace period,
+cancel the timer and reuse the instance.
+
+### Memory budget examples
+
+| Actors | actorsPerInstance | Instances | WASM Memory |
+|--------|-------------------|-----------|-------------|
+| 50     | 50                | 1         | 17 MB       |
+| 200    | 50                | 4         | 66 MB       |
+| 500    | 50                | 10        | 166 MB      |
+| 200    | 25                | 8         | 133 MB      |
+
+Compare to current: 200 actors = 200 instances = 3,320 MB.
+
+## Changes required
+
+### `@rivetkit/sqlite-vfs`
+
+1. **`SqliteSystem`**: Remove single-main-file constraint. Replace
+   `#mainFileName`/`#mainFileOptions` with a `Map<string, KvVfsOptions>`.
+   Update `registerFile()` to insert into the map. Update VFS callbacks to look
+   up options by file handle.
+
+2. **`SqliteVfs`**: Allow multiple `open()` calls with different filenames.
+   Each returns an independent `Database` handle. All share the same WASM
+   module and `#sqliteMutex`.
+
+3. **New `SqliteVfsPool`**: Manages instance lifecycle, actor assignment, and
+   scale-up/scale-down. Exported from the package.
+
+4. **New `PooledSqliteHandle`**: Returned by `pool.acquire()`. Implements the
+   subset of `SqliteVfs` that actors use (`open`, `destroy`).
+
+### `rivetkit` (drivers)
+
+5. **`ActorDriver` interface**: `createSqliteVfs()` signature adds `actorId`
+   parameter so the pool can do sticky assignment.
+
+6. **File-system driver**: Create `SqliteVfsPool` once, call
+   `pool.acquire(actorId)` in `createSqliteVfs()`.
+
+7. **Engine driver**: Same change as file-system driver.
+
+8. **Actor instance (`mod.ts`)**: Pass `actorId` to `driver.createSqliteVfs(actorId)`.
+   No other changes needed — the handle quacks like `SqliteVfs`.
+
+### Not changed
+
+- Cloudflare driver (uses D1, no WASM)
+- KV storage layer (unchanged)
+- Drizzle integration (unchanged, still receives a `Database` from `open()`)
+- `#sqliteMutex` behavior (unchanged, already serializes correctly)
+
+## Risks
+
+1. **Hot instance**: If one instance has 50 chatty actors, the mutex contention
+   increases latency for all of them. Mitigation: monitor mutex wait time, tune
+   `actorsPerInstance` down if needed.
+
+2. **WASM memory growth**: SQLite can grow WASM linear memory via
+   `memory.grow()`. If one actor causes growth, all actors on that instance pay
+   the cost. In practice, SQLite's page cache is small and growth is rare.
+
+3. **Database close ordering**: If actor A crashes without closing its DB, the
+   open file handle leaks inside the VFS. The pool must track open databases
+   and force-close on `release()`.