diff --git a/packages/coding-agent/docs/undercompaction.md b/packages/coding-agent/docs/undercompaction.md new file mode 100644 index 00000000..3de29719 --- /dev/null +++ b/packages/coding-agent/docs/undercompaction.md @@ -0,0 +1,313 @@ +# Under-Compaction Analysis + +## Problem Statement + +Auto-compaction triggers too late, causing context window overflows that result in failed LLM calls with `stopReason == "length"`. + +## Architecture Overview + +### Event Flow + +``` +User prompt + │ + ▼ +agent.prompt() + │ + ▼ +agentLoop() in packages/ai/src/agent/agent-loop.ts + │ + ├─► streamAssistantResponse() + │ │ + │ ▼ + │ LLM provider (Anthropic, OpenAI, etc.) + │ │ + │ ▼ + │ Events: message_start → message_update* → message_end + │ │ + │ ▼ + │ AssistantMessage with usage stats (input, output, cacheRead, cacheWrite) + │ + ├─► If assistant has tool calls: + │ │ + │ ▼ + │ executeToolCalls() + │ │ + │ ├─► tool_execution_start (toolCallId, toolName, args) + │ │ + │ ├─► tool.execute() runs (read, bash, write, edit, etc.) + │ │ + │ ├─► tool_execution_end (toolCallId, toolName, result, isError) + │ │ + │ └─► message_start + message_end for ToolResultMessage + │ + └─► Loop continues until no more tool calls + │ + ▼ + agent_end +``` + +### Token Usage Reporting + +Token usage is ONLY available in `AssistantMessage.usage` after the LLM responds: + +```typescript +// From packages/ai/src/types.ts +export interface Usage { + input: number; // Tokens in the request + output: number; // Tokens generated + cacheRead: number; // Cached tokens read + cacheWrite: number; // Cached tokens written + cost: Cost; +} +``` + +The `input` field represents the total context size sent to the LLM, which includes: +- System prompt +- All conversation messages +- All tool results from previous calls + +### Current Compaction Check + +Both TUI (`tui-renderer.ts`) and RPC (`main.ts`) modes check compaction identically: + +```typescript +// In agent.subscribe() callback: +if (event.type === "message_end") { + // ... + if (event.message.role === "assistant") { + await checkAutoCompaction(); + } +} + +async function checkAutoCompaction() { + // Get last non-aborted assistant message + const messages = agent.state.messages; + let lastAssistant = findLastNonAbortedAssistant(messages); + if (!lastAssistant) return; + + const contextTokens = calculateContextTokens(lastAssistant.usage); + const contextWindow = agent.state.model.contextWindow; + + if (!shouldCompact(contextTokens, contextWindow, settings)) return; + + // Trigger compaction... +} +``` + +**The check happens on `message_end` for assistant messages only.** + +## The Under-Compaction Problem + +### Failure Scenario + +``` +Context window: 200,000 tokens +Reserve tokens: 16,384 (default) +Threshold: 200,000 - 16,384 = 183,616 + +Turn N: + 1. Assistant message received, usage shows 180,000 tokens + 2. shouldCompact(180000, 200000, settings) → 180000 > 183616 → FALSE + 3. Tool executes: `cat large-file.txt` → outputs 100KB (~25,000 tokens) + 4. Context now effectively 205,000 tokens, but we don't know this + 5. Next LLM call fails: context exceeds 200,000 window +``` + +The problem occurs when: +1. Context is below threshold (so compaction doesn't trigger) +2. A tool adds enough content to push it over the window limit +3. We only discover this when the next LLM call fails + +### Root Cause + +1. **Token counts are retrospective**: We only learn the context size AFTER the LLM processes it +2. **Tool results are blind spots**: When a tool executes and returns a large result, we don't know how many tokens it adds until the next LLM call +3. **No estimation before submission**: We submit the context and hope it fits + +## Current Tool Output Limits + +| Tool | Our Limit | Worst Case | +|------|-----------|------------| +| bash | 10MB per stream | 20MB (~5M tokens) | +| read | 2000 lines × 2000 chars | 4MB (~1M tokens) | +| write | Byte count only | Minimal | +| edit | Diff output | Variable | + +## How Other Tools Handle This + +### SST/OpenCode + +**Tool Output Limits (during execution):** + +| Tool | Limit | Details | +|------|-------|---------| +| bash | 30KB chars | `MAX_OUTPUT_LENGTH = 30_000`, truncates with notice | +| read | 2000 lines × 2000 chars/line | No total cap, theoretically 4MB | +| grep | 100 matches, 2000 chars/line | Truncates with notice | +| ls | 100 files | Truncates with notice | +| glob | 100 results | Truncates with notice | +| webfetch | 5MB | `MAX_RESPONSE_SIZE` | + +**Overflow Detection:** +- `isOverflow()` runs BEFORE each turn (not during) +- Uses last LLM-reported token count: `tokens.input + tokens.cache.read + tokens.output` +- Triggers if `count > context - maxOutput` +- Does NOT detect overflow from tool results in current turn + +**Recovery - Pruning:** +- `prune()` runs AFTER each turn completes +- Walks backwards through completed tool results +- Keeps last 40k tokens of tool outputs (`PRUNE_PROTECT`) +- Removes content from older tool results (marks `time.compacted`) +- Only prunes if savings > 20k tokens (`PRUNE_MINIMUM`) +- Token estimation: `chars / 4` + +**Recovery - Compaction:** +- Triggered when `isOverflow()` returns true before a turn +- LLM generates summary of conversation +- Replaces old messages with summary + +**Gap:** No mid-turn protection. A single read returning 4MB would overflow. The 30KB bash limit is their primary practical protection. + +### OpenAI/Codex + +**Tool Output Limits (during execution):** + +| Tool | Limit | Details | +|------|-------|---------| +| shell/exec | 10k tokens or 10k bytes | Per-model `TruncationPolicy`, user-configurable | +| read_file | 2000 lines, 500 chars/line | `MAX_LINE_LENGTH = 500`, ~1MB max | +| grep_files | 100 matches | Default limit | +| list_dir | Configurable | BFS with depth limits | + +**Truncation Policy:** +- Per-model family setting: `TruncationPolicy::Bytes(10_000)` or `TruncationPolicy::Tokens(10_000)` +- User can override via `tool_output_token_limit` config +- Applied to ALL tool outputs uniformly via `truncate_function_output_items_with_policy()` +- Preserves beginning and end, removes middle with `"…N tokens truncated…"` marker + +**Overflow Detection:** +- After each successful turn: `if total_usage_tokens >= auto_compact_token_limit { compact() }` +- Per-model thresholds (e.g., 180k for 200k context window) +- `ContextWindowExceeded` error caught and handled + +**Recovery - Compaction:** +- If tokens exceed threshold after turn, triggers `run_inline_auto_compact_task()` +- During compaction, if `ContextWindowExceeded`: removes oldest history item and retries +- Loop: `history.remove_first_item()` until it fits +- Notifies user: "Trimmed N older conversation item(s)" + +**Recovery - Turn Error:** +- On `ContextWindowExceeded` during normal turn: marks tokens as full, returns error to user +- Does NOT auto-retry the failed turn +- User must manually continue + +**Gap:** Still no mid-turn protection, but aggressive 10k token truncation on all tool outputs prevents most issues in practice. + +### Comparison + +| Feature | pi-coding-agent | OpenCode | Codex | +|---------|-----------------|----------|-------| +| Bash limit | 10MB | 30KB | ~40KB (10k tokens) | +| Read limit | 2000×2000 (4MB) | 2000×2000 (4MB) | 2000×500 (1MB) | +| Truncation policy | None | Per-tool | Per-model, uniform | +| Token estimation | None | chars/4 | chars/4 | +| Pre-turn check | No | Yes (last tokens) | Yes (threshold) | +| Mid-turn check | No | No | No | +| Post-turn pruning | No | Yes (removes old tool output) | No | +| Overflow recovery | No | Compaction | Trim oldest + compact | + +**Key insight:** None of these tools protect against mid-turn overflow. Their practical protection is aggressive static limits on tool output, especially bash. OpenCode's 30KB bash limit vs our 10MB is the critical difference. + +## Recommended Solution + +### Phase 1: Static Limits (immediate) + +Add hard limits to tool outputs matching industry practice: + +```typescript +// packages/coding-agent/src/tools/limits.ts +export const MAX_TOOL_OUTPUT_CHARS = 30_000; // ~7.5k tokens, matches OpenCode bash +export const MAX_TOOL_OUTPUT_NOTICE = "\n\n...(truncated, output exceeded limit)..."; +``` + +Apply to all tools: +- bash: 10MB → 30KB +- read: Add 100KB total output cap +- edit: Cap diff output + +### Phase 2: Post-Tool Estimation + +After `tool_execution_end`, estimate and flag: + +```typescript +let needsCompactionAfterTurn = false; + +agent.subscribe(async (event) => { + if (event.type === "tool_execution_end") { + const resultChars = extractTextLength(event.result); + const estimatedTokens = Math.ceil(resultChars / 4); + + const lastUsage = getLastAssistantUsage(agent.state.messages); + if (lastUsage) { + const current = calculateContextTokens(lastUsage); + const projected = current + estimatedTokens; + const threshold = agent.state.model.contextWindow - settings.reserveTokens; + if (projected > threshold) { + needsCompactionAfterTurn = true; + } + } + } + + if (event.type === "turn_end" && needsCompactionAfterTurn) { + needsCompactionAfterTurn = false; + await triggerCompaction(); + } +}); +``` + +### Phase 3: Overflow Recovery (like Codex) + +Handle `stopReason === "length"` gracefully: + +```typescript +if (event.type === "message_end" && event.message.role === "assistant") { + if (event.message.stopReason === "length") { + // Context overflow occurred + await triggerCompaction(); + // Optionally: retry the turn + } +} +``` + +During compaction, if it also overflows, trim oldest messages: + +```typescript +async function compactWithRetry() { + while (true) { + try { + await compact(); + break; + } catch (e) { + if (isContextOverflow(e) && messages.length > 1) { + messages.shift(); // Remove oldest + continue; + } + throw e; + } + } +} +``` + +## Summary + +The under-compaction problem occurs because: +1. We only check context size after assistant messages +2. Tool results can add arbitrary amounts of content +3. We discover overflows only when the next LLM call fails + +The fix requires: +1. Aggressive static limits on tool output (immediate safety net) +2. Token estimation after tool execution (proactive detection) +3. Graceful handling of overflow errors (fallback recovery)