co-mono/packages/coding-agent/docs/undercompaction.md
Mario Zechner 10a1e1ef90 docs: add under-compaction analysis
Documents context window overflow scenarios, how OpenCode and Codex
handle them, and what fixes are needed.

Related to #128
2025-12-06 14:53:05 +01:00

10 KiB
Raw Blame History

Under-Compaction Analysis

Problem Statement

Auto-compaction triggers too late, causing context window overflows that result in failed LLM calls with stopReason == "length".

Architecture Overview

Event Flow

User prompt
    │
    ▼
agent.prompt()
    │
    ▼
agentLoop() in packages/ai/src/agent/agent-loop.ts
    │
    ├─► streamAssistantResponse()
    │       │
    │       ▼
    │   LLM provider (Anthropic, OpenAI, etc.)
    │       │
    │       ▼
    │   Events: message_start → message_update* → message_end
    │       │
    │       ▼
    │   AssistantMessage with usage stats (input, output, cacheRead, cacheWrite)
    │
    ├─► If assistant has tool calls:
    │       │
    │       ▼
    │   executeToolCalls()
    │       │
    │       ├─► tool_execution_start (toolCallId, toolName, args)
    │       │
    │       ├─► tool.execute() runs (read, bash, write, edit, etc.)
    │       │
    │       ├─► tool_execution_end (toolCallId, toolName, result, isError)
    │       │
    │       └─► message_start + message_end for ToolResultMessage
    │
    └─► Loop continues until no more tool calls
            │
            ▼
        agent_end

Token Usage Reporting

Token usage is ONLY available in AssistantMessage.usage after the LLM responds:

// From packages/ai/src/types.ts
export interface Usage {
    input: number;      // Tokens in the request
    output: number;     // Tokens generated
    cacheRead: number;  // Cached tokens read
    cacheWrite: number; // Cached tokens written
    cost: Cost;
}

The input field represents the total context size sent to the LLM, which includes:

  • System prompt
  • All conversation messages
  • All tool results from previous calls

Current Compaction Check

Both TUI (tui-renderer.ts) and RPC (main.ts) modes check compaction identically:

// In agent.subscribe() callback:
if (event.type === "message_end") {
    // ...
    if (event.message.role === "assistant") {
        await checkAutoCompaction();
    }
}

async function checkAutoCompaction() {
    // Get last non-aborted assistant message
    const messages = agent.state.messages;
    let lastAssistant = findLastNonAbortedAssistant(messages);
    if (!lastAssistant) return;

    const contextTokens = calculateContextTokens(lastAssistant.usage);
    const contextWindow = agent.state.model.contextWindow;

    if (!shouldCompact(contextTokens, contextWindow, settings)) return;

    // Trigger compaction...
}

The check happens on message_end for assistant messages only.

The Under-Compaction Problem

Failure Scenario

Context window: 200,000 tokens
Reserve tokens: 16,384 (default)
Threshold: 200,000 - 16,384 = 183,616

Turn N:
  1. Assistant message received, usage shows 180,000 tokens
  2. shouldCompact(180000, 200000, settings) → 180000 > 183616 → FALSE
  3. Tool executes: `cat large-file.txt` → outputs 100KB (~25,000 tokens)
  4. Context now effectively 205,000 tokens, but we don't know this
  5. Next LLM call fails: context exceeds 200,000 window

The problem occurs when:

  1. Context is below threshold (so compaction doesn't trigger)
  2. A tool adds enough content to push it over the window limit
  3. We only discover this when the next LLM call fails

Root Cause

  1. Token counts are retrospective: We only learn the context size AFTER the LLM processes it
  2. Tool results are blind spots: When a tool executes and returns a large result, we don't know how many tokens it adds until the next LLM call
  3. No estimation before submission: We submit the context and hope it fits

Current Tool Output Limits

Tool Our Limit Worst Case
bash 10MB per stream 20MB (~5M tokens)
read 2000 lines × 2000 chars 4MB (~1M tokens)
write Byte count only Minimal
edit Diff output Variable

How Other Tools Handle This

SST/OpenCode

Tool Output Limits (during execution):

Tool Limit Details
bash 30KB chars MAX_OUTPUT_LENGTH = 30_000, truncates with notice
read 2000 lines × 2000 chars/line No total cap, theoretically 4MB
grep 100 matches, 2000 chars/line Truncates with notice
ls 100 files Truncates with notice
glob 100 results Truncates with notice
webfetch 5MB MAX_RESPONSE_SIZE

Overflow Detection:

  • isOverflow() runs BEFORE each turn (not during)
  • Uses last LLM-reported token count: tokens.input + tokens.cache.read + tokens.output
  • Triggers if count > context - maxOutput
  • Does NOT detect overflow from tool results in current turn

Recovery - Pruning:

  • prune() runs AFTER each turn completes
  • Walks backwards through completed tool results
  • Keeps last 40k tokens of tool outputs (PRUNE_PROTECT)
  • Removes content from older tool results (marks time.compacted)
  • Only prunes if savings > 20k tokens (PRUNE_MINIMUM)
  • Token estimation: chars / 4

Recovery - Compaction:

  • Triggered when isOverflow() returns true before a turn
  • LLM generates summary of conversation
  • Replaces old messages with summary

Gap: No mid-turn protection. A single read returning 4MB would overflow. The 30KB bash limit is their primary practical protection.

OpenAI/Codex

Tool Output Limits (during execution):

Tool Limit Details
shell/exec 10k tokens or 10k bytes Per-model TruncationPolicy, user-configurable
read_file 2000 lines, 500 chars/line MAX_LINE_LENGTH = 500, ~1MB max
grep_files 100 matches Default limit
list_dir Configurable BFS with depth limits

Truncation Policy:

  • Per-model family setting: TruncationPolicy::Bytes(10_000) or TruncationPolicy::Tokens(10_000)
  • User can override via tool_output_token_limit config
  • Applied to ALL tool outputs uniformly via truncate_function_output_items_with_policy()
  • Preserves beginning and end, removes middle with "…N tokens truncated…" marker

Overflow Detection:

  • After each successful turn: if total_usage_tokens >= auto_compact_token_limit { compact() }
  • Per-model thresholds (e.g., 180k for 200k context window)
  • ContextWindowExceeded error caught and handled

Recovery - Compaction:

  • If tokens exceed threshold after turn, triggers run_inline_auto_compact_task()
  • During compaction, if ContextWindowExceeded: removes oldest history item and retries
  • Loop: history.remove_first_item() until it fits
  • Notifies user: "Trimmed N older conversation item(s)"

Recovery - Turn Error:

  • On ContextWindowExceeded during normal turn: marks tokens as full, returns error to user
  • Does NOT auto-retry the failed turn
  • User must manually continue

Gap: Still no mid-turn protection, but aggressive 10k token truncation on all tool outputs prevents most issues in practice.

Comparison

Feature pi-coding-agent OpenCode Codex
Bash limit 10MB 30KB ~40KB (10k tokens)
Read limit 2000×2000 (4MB) 2000×2000 (4MB) 2000×500 (1MB)
Truncation policy None Per-tool Per-model, uniform
Token estimation None chars/4 chars/4
Pre-turn check No Yes (last tokens) Yes (threshold)
Mid-turn check No No No
Post-turn pruning No Yes (removes old tool output) No
Overflow recovery No Compaction Trim oldest + compact

Key insight: None of these tools protect against mid-turn overflow. Their practical protection is aggressive static limits on tool output, especially bash. OpenCode's 30KB bash limit vs our 10MB is the critical difference.

Phase 1: Static Limits (immediate)

Add hard limits to tool outputs matching industry practice:

// packages/coding-agent/src/tools/limits.ts
export const MAX_TOOL_OUTPUT_CHARS = 30_000; // ~7.5k tokens, matches OpenCode bash
export const MAX_TOOL_OUTPUT_NOTICE = "\n\n...(truncated, output exceeded limit)...";

Apply to all tools:

  • bash: 10MB → 30KB
  • read: Add 100KB total output cap
  • edit: Cap diff output

Phase 2: Post-Tool Estimation

After tool_execution_end, estimate and flag:

let needsCompactionAfterTurn = false;

agent.subscribe(async (event) => {
    if (event.type === "tool_execution_end") {
        const resultChars = extractTextLength(event.result);
        const estimatedTokens = Math.ceil(resultChars / 4);
        
        const lastUsage = getLastAssistantUsage(agent.state.messages);
        if (lastUsage) {
            const current = calculateContextTokens(lastUsage);
            const projected = current + estimatedTokens;
            const threshold = agent.state.model.contextWindow - settings.reserveTokens;
            if (projected > threshold) {
                needsCompactionAfterTurn = true;
            }
        }
    }
    
    if (event.type === "turn_end" && needsCompactionAfterTurn) {
        needsCompactionAfterTurn = false;
        await triggerCompaction();
    }
});

Phase 3: Overflow Recovery (like Codex)

Handle stopReason === "length" gracefully:

if (event.type === "message_end" && event.message.role === "assistant") {
    if (event.message.stopReason === "length") {
        // Context overflow occurred
        await triggerCompaction();
        // Optionally: retry the turn
    }
}

During compaction, if it also overflows, trim oldest messages:

async function compactWithRetry() {
    while (true) {
        try {
            await compact();
            break;
        } catch (e) {
            if (isContextOverflow(e) && messages.length > 1) {
                messages.shift(); // Remove oldest
                continue;
            }
            throw e;
        }
    }
}

Summary

The under-compaction problem occurs because:

  1. We only check context size after assistant messages
  2. Tool results can add arbitrary amounts of content
  3. We discover overflows only when the next LLM call fails

The fix requires:

  1. Aggressive static limits on tool output (immediate safety net)
  2. Token estimation after tool execution (proactive detection)
  3. Graceful handling of overflow errors (fallback recovery)