mirror of
https://github.com/harivansh-afk/sandbox-agent.git
synced 2026-04-16 20:01:27 +00:00
feat(foundry): memory investigation tooling and VFS pool spec
Add memory monitoring instrumentation, investigation findings, and SQLite VFS pool design spec for addressing WASM SQLite memory spikes. - Add /debug/memory endpoint and periodic memory logging (dev only) - Add mem-monitor.sh script for continuous memory profiling with automatic heap snapshot capture on spike detection - Add configureRunnerPool to registry setup for engine driver support - Document memory investigation findings (per-actor cost, spike behavior) - Write SQLite VFS pool spec for bin-packing actors onto shared WASM instances - Add foundry-mem-monitor and foundry-dev-engine justfile recipes - Add compose.dev.yaml engine driver and platform support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7b23e519c2
commit
ee99d0b318
18 changed files with 888 additions and 496 deletions
88
foundry/research/memory-investigation.md
Normal file
88
foundry/research/memory-investigation.md
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# Foundry Backend Memory Investigation
|
||||
|
||||
Date: 2026-03-17
|
||||
|
||||
## Problem
|
||||
|
||||
Production Railway deployment shows memory spikes from near-zero to 40+ GB when users interact with the app. Local reproduction shows spikes from ~300 MB to ~2.1 GB when opening a task workspace.
|
||||
|
||||
## Architecture
|
||||
|
||||
Each actor in the system has **two SQLite instances**:
|
||||
|
||||
1. **WASM SQLite** (16.6 MB per actor) - Runs Drizzle ORM queries for actor-specific tables (task data, session transcripts, etc.). Each actor gets its own `SqliteVfs` which instantiates a full `WebAssembly.Instance` with 16.6 MB linear memory.
|
||||
|
||||
2. **Native bun:sqlite** (~4-8 MB per actor) - Backs the KV store that the WASM SQLite's VFS reads/writes to. This is the persistence layer. Not visible in JS heap snapshots (native C memory).
|
||||
|
||||
## Findings
|
||||
|
||||
### Memory breakdown (steady state, 14 active WASM instances)
|
||||
|
||||
| Category | Size | % of RSS | Description |
|
||||
|----------|------|----------|-------------|
|
||||
| WASM SQLite heaps | 232 MB | 46% | 14 x 16.6 MB ArrayBuffers (WASM linear memory) |
|
||||
| Bun native (bun:sqlite + runtime) | 225 MB | 44% | KV backing store page caches, mmap'd WAL files, Bun runtime |
|
||||
| JS application objects | 27 MB | 5% | Closures, actor state, plain objects |
|
||||
| Module graph | 20 MB | 4% | Compiled code, FunctionCodeBlocks, ModuleRecords |
|
||||
| ArrayBuffer intermediates | 4 MB | 1% | Non-WASM buffers |
|
||||
| KV data in transit | ~0 MB | 0% | 4KB chunks copied and freed immediately |
|
||||
|
||||
### Spike behavior
|
||||
|
||||
When opening a task workspace, many actors wake simultaneously:
|
||||
|
||||
| State | WASM Instances | SqliteVfs | WASM Heap | Actors (task) | RSS |
|
||||
|-------|---------------|-----------|-----------|---------------|-----|
|
||||
| Baseline | 7-9 | 6-8 | 116-149 MB | 14 | 289-309 MB |
|
||||
| Spike | 32 | 32 | 531 MB | 25 | 2,118 MB |
|
||||
| Post-sleep | 14 | 13 | 232 MB | 25 (23 sleeping) | 509 MB |
|
||||
|
||||
### Per-actor memory cost
|
||||
|
||||
Each actor that wakes up and accesses its database costs:
|
||||
- 16.6 MB for WASM SQLite linear memory
|
||||
- ~4-8 MB for native bun:sqlite KV backing store
|
||||
- **Total: ~20-25 MB per actor**
|
||||
|
||||
### No per-actor WASM leak
|
||||
|
||||
Controlled testing (3 wake/sleep cycles on a single actor) confirmed WASM is properly freed on sleep:
|
||||
- Wake: +1 SqliteVfs, +17 MB
|
||||
- Sleep: -1 SqliteVfs, -17 MB
|
||||
- No accumulation across cycles
|
||||
|
||||
### Production impact
|
||||
|
||||
With 200+ PRs in production, if something wakes all task actors simultaneously:
|
||||
- 200 actors x 25 MB = 5 GB minimum
|
||||
- Plus JS garbage from git operations, sandbox bootstraps, etc.
|
||||
- Explains the 40 GB spike seen on Railway (multiple replicas, plus GC pressure)
|
||||
|
||||
### The double-SQLite problem
|
||||
|
||||
The current file-system driver architecture means every actor runs SQLite-in-WASM on top of SQLite-native:
|
||||
|
||||
```
|
||||
Actor Drizzle queries
|
||||
-> WASM SQLite (16.6 MB heap)
|
||||
-> VFS layer (copies 4KB chunks)
|
||||
-> KV store API
|
||||
-> bun:sqlite (native, ~4-8 MB page cache)
|
||||
-> disk (.db files)
|
||||
```
|
||||
|
||||
The engine driver eliminates the WASM layer entirely, using the Rust engine's native SQLite directly.
|
||||
|
||||
## Root causes of mass actor wake-up
|
||||
|
||||
1. `maybeScheduleWorkspaceRefreshes()` is called twice per `getTaskDetail()` (once directly, once via `buildTaskSummary()`)
|
||||
2. ~~`getWorkspace()` fetches ALL task details in parallel, waking all task actors~~ **Dead code — removed 2026-03-17.** The frontend uses the subscription system exclusively; `getWorkspaceCompat` and `RemoteWorkspaceStore` had zero callers.
|
||||
3. Frontend retry interval is 1 second with no backoff
|
||||
4. No deduplication of concurrent `collectWorkspaceGitState()` calls
|
||||
|
||||
## Next steps
|
||||
|
||||
- [ ] Test with engine driver enabled to measure WASM elimination impact
|
||||
- [ ] Investigate what triggers mass actor wake-up in production (the `getWorkspace` fan-out was dead code; the actual trigger is still unknown)
|
||||
- [ ] Consider sharing a single WASM module across actors (mutex around non-reentrant init)
|
||||
- [ ] Enable periodic memory logging in production to capture state before OOM kills
|
||||
214
foundry/research/sqlite-vfs-pool-spec.md
Normal file
214
foundry/research/sqlite-vfs-pool-spec.md
Normal file
|
|
@ -0,0 +1,214 @@
|
|||
# SQLite VFS Pool Spec
|
||||
|
||||
Date: 2026-03-17
|
||||
Package: `@rivetkit/sqlite-vfs`
|
||||
Scope: WASM SQLite only (not Cloudflare D1 driver)
|
||||
|
||||
## Problem
|
||||
|
||||
Each actor gets its own WASM SQLite instance via `SqliteVfs`, allocating 16.6 MB
|
||||
of linear memory per instance. With 200+ actors waking simultaneously, this
|
||||
causes multi-GB memory spikes (40 GB observed in production).
|
||||
|
||||
## Design
|
||||
|
||||
### Pool model
|
||||
|
||||
A `SqliteVfsPool` manages N WASM SQLite instances. Actors are bin-packed onto
|
||||
instances via sticky assignment. The pool scales instances up to a configured
|
||||
max as actors arrive, and scales down (after a grace period) when instances have
|
||||
zero assigned actors.
|
||||
|
||||
### Configuration
|
||||
|
||||
```typescript
|
||||
interface SqliteVfsPoolConfig {
|
||||
/** Max actors sharing one WASM instance. Default: 50. */
|
||||
actorsPerInstance: number;
|
||||
/** Max WASM instances the pool will create. Default: Infinity. */
|
||||
maxInstances?: number;
|
||||
/** Grace period before destroying an empty instance. Default: 30_000ms. */
|
||||
idleDestroyMs?: number;
|
||||
}
|
||||
```
|
||||
|
||||
**Sizing guide**: each WASM instance handles ~13 SQLite ops/sec at 15ms KV RTT
|
||||
(66 KV ops/sec / ~5 KV ops per SQLite operation). For a target of X ops/sec,
|
||||
set `actorsPerInstance = totalActors / ceil(X / 13)`.
|
||||
|
||||
### Actor-to-instance assignment
|
||||
|
||||
Sticky assignment: once an actor is assigned to an instance, it stays there
|
||||
until it releases (actor sleep/destroy). Assignment uses bin-packing: pick the
|
||||
instance with the most actors that still has capacity. If all instances are
|
||||
full, create a new one (up to `maxInstances`).
|
||||
|
||||
```
|
||||
acquire(actorId) -> PooledSqliteHandle
|
||||
1. If actorId already assigned, return existing handle
|
||||
2. Find instance with most actors that has capacity (< actorsPerInstance)
|
||||
3. If none found and instanceCount < maxInstances, create new instance
|
||||
4. If none found and at max, wait (queue)
|
||||
5. Assign actorId to instance, return handle
|
||||
|
||||
release(actorId)
|
||||
1. Remove actorId from instance's assignment set
|
||||
2. If instance has zero actors, start idle timer
|
||||
3. On idle timer expiry, destroy instance (reclaim 16.6 MB)
|
||||
4. Cancel idle timer if a new actor is assigned before expiry
|
||||
```
|
||||
|
||||
### Locking mechanism
|
||||
|
||||
The existing `#sqliteMutex` on `SqliteVfs` already serializes SQLite operations
|
||||
within one instance. This is the right level: each individual xRead/xWrite call
|
||||
acquires the mutex, does its async KV operation, and releases. No change needed
|
||||
to the mutex itself.
|
||||
|
||||
Multiple databases on the same instance share the mutex. This means if actor A
|
||||
is doing an xRead (15ms), actor B on the same instance waits. This is the
|
||||
intentional serialization — asyncify cannot handle concurrent suspensions on the
|
||||
same WASM module.
|
||||
|
||||
The pool does NOT add a higher-level lock. The per-instance `#sqliteMutex`
|
||||
handles all serialization. The pool only manages assignment and lifecycle.
|
||||
|
||||
### Multiple databases per instance
|
||||
|
||||
Currently `SqliteSystem.registerFile()` enforces one main database file per VFS.
|
||||
This constraint must be lifted to allow multiple actors' databases to coexist.
|
||||
|
||||
**Change**: `SqliteSystem` tracks multiple registered files in a `Map<string, KvVfsOptions>`
|
||||
instead of a single `#mainFileName`. The VFS callbacks (`xRead`, `xWrite`, etc.)
|
||||
already receive the file handle and look up the correct options per file.
|
||||
|
||||
Each actor opens its own database file (named by actorId) on the shared VFS.
|
||||
Multiple databases can be open simultaneously on the same WASM instance. The
|
||||
`#sqliteMutex` ensures only one SQLite call executes at a time.
|
||||
|
||||
### PooledSqliteHandle
|
||||
|
||||
The handle returned to actors wraps a reference to the pool and its assigned
|
||||
instance. It exposes the same `open()` interface as `SqliteVfs`.
|
||||
|
||||
```typescript
|
||||
class PooledSqliteHandle {
|
||||
readonly #pool: SqliteVfsPool;
|
||||
readonly #instanceId: number;
|
||||
readonly #actorId: string;
|
||||
|
||||
/** Open a database on this handle's assigned WASM instance. */
|
||||
async open(fileName: string, options: KvVfsOptions): Promise<Database> {
|
||||
const vfs = this.#pool.getInstance(this.#instanceId);
|
||||
return vfs.open(fileName, options);
|
||||
}
|
||||
|
||||
/** Release this handle back to the pool. */
|
||||
async destroy(): Promise<void> {
|
||||
this.#pool.release(this.#actorId);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integration with drivers
|
||||
|
||||
The `ActorDriver.createSqliteVfs()` method currently returns `new SqliteVfs()`.
|
||||
With pooling:
|
||||
|
||||
```typescript
|
||||
// Before
|
||||
async createSqliteVfs(): Promise<SqliteVfs> {
|
||||
return new SqliteVfs();
|
||||
}
|
||||
|
||||
// After
|
||||
async createSqliteVfs(actorId: string): Promise<PooledSqliteHandle> {
|
||||
return this.#vfsPool.acquire(actorId);
|
||||
}
|
||||
```
|
||||
|
||||
The `PooledSqliteHandle` must satisfy the same interface that actors expect from
|
||||
`SqliteVfs` (specifically the `open()` and `destroy()` methods). Either:
|
||||
- `PooledSqliteHandle` implements the `SqliteVfs` interface (duck typing)
|
||||
- Or extract an interface type that both implement
|
||||
|
||||
The actor instance code in `mod.ts` calls `this.#sqliteVfs = await driver.createSqliteVfs()`.
|
||||
It then passes `this.#sqliteVfs` to the DB provider which calls `.open()`. On
|
||||
cleanup it calls `.destroy()`. The pooled handle supports both.
|
||||
|
||||
### Scale-up and scale-down
|
||||
|
||||
**Scale-up**: new instance created lazily on `acquire()` when all existing
|
||||
instances are at capacity. WASM module is loaded in `#ensureInitialized()` on
|
||||
first `open()` call (existing lazy behavior). Cost: ~16.6 MB + WASM compile time.
|
||||
|
||||
**Scale-down**: when last actor releases from an instance, start a timer
|
||||
(`idleDestroyMs`). If no new actor is assigned before the timer fires, call
|
||||
`sqliteVfs.destroy()` to free the WASM module. This reclaims 16.6 MB.
|
||||
|
||||
If an actor is assigned to an instance that is in the idle-destroy grace period,
|
||||
cancel the timer and reuse the instance.
|
||||
|
||||
### Memory budget examples
|
||||
|
||||
| Actors | actorsPerInstance | Instances | WASM Memory |
|
||||
|--------|-------------------|-----------|-------------|
|
||||
| 50 | 50 | 1 | 17 MB |
|
||||
| 200 | 50 | 4 | 66 MB |
|
||||
| 500 | 50 | 10 | 166 MB |
|
||||
| 200 | 25 | 8 | 133 MB |
|
||||
|
||||
Compare to current: 200 actors = 200 instances = 3,320 MB.
|
||||
|
||||
## Changes required
|
||||
|
||||
### `@rivetkit/sqlite-vfs`
|
||||
|
||||
1. **`SqliteSystem`**: Remove single-main-file constraint. Replace
|
||||
`#mainFileName`/`#mainFileOptions` with a `Map<string, KvVfsOptions>`.
|
||||
Update `registerFile()` to insert into the map. Update VFS callbacks to look
|
||||
up options by file handle.
|
||||
|
||||
2. **`SqliteVfs`**: Allow multiple `open()` calls with different filenames.
|
||||
Each returns an independent `Database` handle. All share the same WASM
|
||||
module and `#sqliteMutex`.
|
||||
|
||||
3. **New `SqliteVfsPool`**: Manages instance lifecycle, actor assignment, and
|
||||
scale-up/scale-down. Exported from the package.
|
||||
|
||||
4. **New `PooledSqliteHandle`**: Returned by `pool.acquire()`. Implements the
|
||||
subset of `SqliteVfs` that actors use (`open`, `destroy`).
|
||||
|
||||
### `rivetkit` (drivers)
|
||||
|
||||
5. **`ActorDriver` interface**: `createSqliteVfs()` signature adds `actorId`
|
||||
parameter so the pool can do sticky assignment.
|
||||
|
||||
6. **File-system driver**: Create `SqliteVfsPool` once, call
|
||||
`pool.acquire(actorId)` in `createSqliteVfs()`.
|
||||
|
||||
7. **Engine driver**: Same change as file-system driver.
|
||||
|
||||
8. **Actor instance (`mod.ts`)**: Pass `actorId` to `driver.createSqliteVfs(actorId)`.
|
||||
No other changes needed — the handle quacks like `SqliteVfs`.
|
||||
|
||||
### Not changed
|
||||
|
||||
- Cloudflare driver (uses D1, no WASM)
|
||||
- KV storage layer (unchanged)
|
||||
- Drizzle integration (unchanged, still receives a `Database` from `open()`)
|
||||
- `#sqliteMutex` behavior (unchanged, already serializes correctly)
|
||||
|
||||
## Risks
|
||||
|
||||
1. **Hot instance**: If one instance has 50 chatty actors, the mutex contention
|
||||
increases latency for all of them. Mitigation: monitor mutex wait time, tune
|
||||
`actorsPerInstance` down if needed.
|
||||
|
||||
2. **WASM memory growth**: SQLite can grow WASM linear memory via
|
||||
`memory.grow()`. If one actor causes growth, all actors on that instance pay
|
||||
the cost. In practice, SQLite's page cache is small and growth is rare.
|
||||
|
||||
3. **Database close ordering**: If actor A crashes without closing its DB, the
|
||||
open file handle leaks inside the VFS. The pool must track open databases
|
||||
and force-close on `release()`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue