evaluclaude-harness/PLAN.md

# Evaluclaude Harness - Development Plan

> **Goal**: Zero-to-evals in one command. Install into any codebase, Claude analyzes it, generates real functional tests.

---

## The Vibe

This isn't just another test generator. It's a **collaborative eval system** where Claude actually understands your codebase and asks smart questions before generating tests. The key insights:

1. **Claude generates specs, we generate code** — Claude is great at understanding intent, bad at deterministic code gen. We use its strengths.

2. **Functional tests only** — Every test must invoke actual code. No syntax checks. No "output looks good" vibes. Real assertions that catch real bugs.

3. **Conversation, not commands** — During init, Claude asks questions like "I see 3 database models. Which is the core one?" This turns a dumb generator into a thinking partner.

4. **Full observability** — Every eval (deterministic or LLM-graded) has a trace. You can click into it and see exactly what Claude was thinking.

---

## Core Principles (Non-Negotiable)

These are foundational. Don't compromise on them.

### 🌳 Tree-Sitter Introspector
Claude should **never see raw code** for structure extraction. Use tree-sitter to parse Python/TypeScript and extract:
- Function signatures
- Class hierarchies
- Import graphs
- Public APIs

Then send Claude a **structured summary**, not the actual files. This saves tokens, is faster, and more reliable.

### 🔄 Git-Aware Incremental Generation
- `init` command → Full codebase analysis
- `generate` command → Only analyze git diff since last run

Don't re-analyze unchanged files. Massive time/cost savings.

### 🐛 Hooks for Debugging
You WILL need to debug why Claude generated bad specs. Log every tool call:
```typescript
hooks: {
  PostToolUse: [{
    hooks: [async (input) => {
      await trace.log({ tool: input.tool_name, input: input.tool_input });
      return {};
    }]
  }]
}
```

### 💬 AskUserQuestion is Gold
During init, Claude should ask clarifying questions:
- "I see 3 database models. Which is the core domain object?"
- "This API has no tests. Should I generate CRUD tests or skip it?"
- "Found a `config.example.py`. Should tests use these values?"
- "There are 47 utility functions. Want me to prioritize the 10 most-used?"

This transforms eval generation from "spray and pray" to thoughtful, targeted tests.

### 👁️ Full Observability
Every eval run (deterministic AND LLM-graded) must produce a trace:
- What files Claude read
- What questions it asked
- What specs it generated
- The thinking behind each decision

In the UI, you click an eval and see the entire reasoning chain. No black boxes.

### 🔒 Sandbox Mode for Test Execution
Generated tests might do unexpected things. Run them in isolation:
```typescript
sandbox: {
  enabled: true,
  autoAllowBashIfSandboxed: true,
  network: { allowLocalBinding: true }
}
```

---

## Architecture Overview

```
┌──────────────────────────────────────────────────────────────────────────┐
│                         evaluclaude-harness CLI                          │
├──────────────────────────────────────────────────────────────────────────┤
│  init  │  generate  │  run  │  view  │  validate  │  calibrate          │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                     Claude Agent SDK (Local Auth)                        │
│  Uses locally authenticated Claude Code instance                         │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
        ┌────────────────────┴────────────────────┐
        ▼                                         ▼
┌───────────────────┐                   ┌───────────────────┐
│  Analyzer Agent   │                   │  Grader Agent     │
│  (Spec Generator) │                   │  (LLM Rubrics)    │
└─────────┬─────────┘                   └─────────┬─────────┘
          │                                       │
          ▼                                       ▼
┌───────────────────┐                   ┌───────────────────┐
│  EvalSpec JSON    │                   │  Rubrics JSON     │
│  (Deterministic)  │                   │  (Subjective)     │
└─────────┬─────────┘                   └─────────┬─────────┘
          │                                       │
          ▼                                       ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                          Test Renderers                                  │
│  Python (pytest)  │  TypeScript (Vitest/Jest)  │  Grader Scripts        │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                            Promptfoo                                     │
│  Orchestration  │  Execution  │  Results  │  Web UI                     │
└──────────────────────────────────────────────────────────────────────────┘
```

---

## Iteration Areas (Split Work)

The project has **6 major iteration areas** that can be developed in parallel by different agents/sessions. Each requires significant thought and refinement.

> **Note for agents**: Each area is self-contained. You can work on one without deep knowledge of the others. The interfaces between them are defined by TypeScript types (EvalSpec, RepoSummary, TraceEvent, etc.).

---

### 0. Tree-Sitter Introspector (`/src/introspector/`)

**Complexity**: Medium
**Iteration Required**: Moderate
**Priority**: 🔴 FOUNDATIONAL — Build this first

#### What It Does
Parses Python and TypeScript codebases using tree-sitter to extract structured information WITHOUT using LLM tokens. This is the foundation — Claude never sees raw code for structure extraction.

#### Key Outputs

```typescript
interface RepoSummary {
  languages: ('python' | 'typescript')[];
  root: string;

  // File inventory (no content, just structure)
  files: {
    path: string;
    lang: 'python' | 'typescript' | 'other';
    role: 'source' | 'test' | 'config' | 'docs';
    size: number;
  }[];

  // Extracted structure (from tree-sitter, NOT from reading files)
  modules: {
    path: string;
    exports: {
      name: string;
      kind: 'function' | 'class' | 'constant' | 'type';
      signature?: string;  // e.g., "(user_id: int, include_deleted: bool = False) -> User"
      docstring?: string;  // First line only
    }[];
    imports: string[];  // What this module depends on
  }[];

  // Config detection
  config: {
    python?: {
      entryPoints: string[];
      testFramework: 'pytest' | 'unittest' | 'none';
      hasTyping: boolean;
    };
    typescript?: {
      entryPoints: string[];
      testFramework: 'vitest' | 'jest' | 'none';
      hasTypes: boolean;
    };
  };

  // Git info for incremental
  git?: {
    lastAnalyzedCommit: string;
    changedSince: string[];  // Files changed since last analysis
  };
}
```

#### Tree-Sitter Integration

```typescript
import Parser from 'tree-sitter';
import Python from 'tree-sitter-python';
import TypeScript from 'tree-sitter-typescript';

const parser = new Parser();
parser.setLanguage(Python);

const tree = parser.parse(sourceCode);

// Extract function signatures
const query = new Parser.Query(Python, `
  (function_definition
    name: (identifier) @name
    parameters: (parameters) @params
    return_type: (type)? @return
  ) @func
`);

const matches = query.matches(tree.rootNode);
```

#### Git-Aware Incremental

```typescript
// On `generate` command, only re-analyze changed files
async function getChangedFiles(since: string): Promise<string[]> {
  const result = await exec(`git diff --name-only ${since}`);
  return result.stdout.split('\n').filter(f => isSourceFile(f));
}

// Skip unchanged modules in RepoSummary
const incrementalSummary = await introspector.analyze({
  onlyFiles: await getChangedFiles(lastCommit)
});
```

#### Iteration Focus
- Parse accuracy across different Python/TS coding styles
- Handle edge cases: decorators, async functions, generics
- Performance on large codebases (>1000 files)
- Incremental updates without full re-parse

#### Files to Create
```
src/introspector/
├── tree-sitter.ts      # Core parser wrapper
├── python-parser.ts    # Python-specific queries
├── typescript-parser.ts# TS-specific queries
├── git-diff.ts         # Incremental change detection
└── summarizer.ts       # Combine into RepoSummary
```

---

### 1. Codebase Analyzer Prompt (`/prompts/analyzer.md`)

**Complexity**: High
**Iteration Required**: Extensive

#### What It Does
The core LLM prompt that analyzes any Python/TypeScript codebase and outputs a structured `EvalSpec` JSON. Must be language-agnostic in design but produce language-specific output.

#### Key Challenges
- Must work across diverse project structures (monorepos, microservices, simple scripts)
- Must identify *functional behaviors* not just code structure
- Must produce deterministic, unambiguous test specifications
- Must avoid proposing tests that depend on network, time, randomness, or external state

#### EvalSpec Schema (Output Target)

```typescript
interface EvalSpec {
  functional_tests: FunctionalTestSpec[];
  rubric_graders: RubricGraderSpec[];
}

interface FunctionalTestSpec {
  id: string;
  description: string;
  target: {
    runtime: 'python' | 'typescript';
    module_path: string;       // e.g., "src/utils/math.py"
    import_path: string;       // e.g., "project.utils.math"
    callable: string;          // function/method name
    kind: 'function' | 'method' | 'cli' | 'http_handler';
  };
  setup?: {
    env?: Record<string, string>;
    files?: { path: string; content: string }[];
  };
  cases: {
    id: string;
    description: string;
    inputs: unknown;
    expected: {
      return_value?: unknown;
      stdout_contains?: string[];
      stderr_contains?: string[];
      json_schema?: unknown;
      raises_exception?: string;
      requires_mocks?: boolean;
    };
  }[];
}

interface RubricGraderSpec {
  id: string;
  description: string;
  target: {
    kind: 'cli_output' | 'http_response' | 'error_message' | 'doc_page';
    location: string;
  };
  rubric: {
    dimensions: {
      name: string;
      weight: number;
      scale: {
        min: number;
        max: number;
        definitions: { score: number; description: string }[];
      };
    }[];
    overall_scoring: 'weighted_average';
  };
}
```

#### Prompt Structure (Multi-Part)

1. **System Prompt**: Role definition, constraints, JSON-only output
2. **Developer Prompt**: Schema definition, formatting rules
3. **User Prompt**: `RepoSummary` JSON + specific instructions

#### Iteration Focus
- Test against diverse repos: CLI tools, web apps, libraries, ML projects
- Refine heuristics for identifying "testable" vs "non-testable" code
- Tune specificity of test case generation
- Handle edge cases: no tests exist, tests exist but are bad, etc.

#### Files to Create
```
prompts/
├── analyzer-system.md      # Core identity + constraints
├── analyzer-developer.md   # Schema + formatting
└── analyzer-user.md        # Template for RepoSummary injection
```

---

### 2. Synchronous Claude Session with Questions (`/src/session/`)

**Complexity**: Medium-High
**Iteration Required**: Moderate

#### What It Does
Runs Claude Agent SDK synchronously, handles `AskUserQuestion` tool calls, collects user input via CLI, and returns answers to continue the agent.

#### Key Technical Details

**Claude Agent SDK Patterns**:
```typescript
// Using ClaudeSDKClient for multi-turn with questions
import { ClaudeSDKClient, ClaudeAgentOptions } from 'claude-agent-sdk';

const options: ClaudeAgentOptions = {
  allowed_tools: ['Read', 'Glob', 'Grep', 'AskUserQuestion'],
  permission_mode: 'acceptEdits',
  can_use_tool: async (toolName, input, context) => {
    if (toolName === 'AskUserQuestion') {
      // Display questions to user
      const answers = await promptUserForAnswers(input.questions);
      return {
        behavior: 'allow',
        updatedInput: { ...input, answers }
      };
    }
    return { behavior: 'allow', updatedInput: input };
  }
};
```

**Two Operating Modes**:

| Mode | Behavior | Use Case |
|------|----------|----------|
| `--interactive` | Questions allowed, waits for user | Local development |
| `--non-interactive` | Questions forbidden, best-effort | CI/CD, automation |

#### Iteration Focus
- Clean CLI UX for question display (multi-choice, free text)
- Timeout handling (60s limit per question)
- Graceful fallback when questions are disabled
- Session persistence for resuming interrupted generation

#### Files to Create
```
src/session/
├── client.ts           # ClaudeSDKClient wrapper
├── question-handler.ts # AskUserQuestion UI
├── modes.ts            # Interactive vs non-interactive
└── persistence.ts      # Session save/resume
```

---

### 3. Test Renderers (Deterministic Code Gen) (`/src/renderers/`)

**Complexity**: Medium
**Iteration Required**: Moderate

#### What It Does
Transforms `EvalSpec` JSON into actual runnable test files. **Claude generates specs, we generate code** — this is critical for reliability.

#### Python Renderer (pytest)

**Input**: `FunctionalTestSpec`
**Output**: `.py` file in `.evaluclaude/tests/`

```python
# Generated: .evaluclaude/tests/test_{id}.py
import importlib
import pytest

module = importlib.import_module("{import_path}")
target = getattr(module, "{callable}")

@pytest.mark.evaluclaude
def test_{id}_{case_id}():
    result = target(**{inputs})
    assert result == {expected.return_value}
```

**Features**:
- `pytest` fixtures for env vars (`monkeypatch.setenv`)
- `tmp_path` for file setup
- `capsys` for stdout/stderr assertions
- Parameterized tests for multiple cases

#### TypeScript Renderer (Vitest/Jest)

**Input**: `FunctionalTestSpec`
**Output**: `.test.ts` file in `.evaluclaude/tests/`

```typescript
// Generated: .evaluclaude/tests/{id}.test.ts
import { describe, it, expect } from 'vitest';
import { callable } from '{import_path}';

describe('{description}', () => {
  it('{case.description}', () => {
    const result = callable({inputs});
    expect(result).toEqual({expected.return_value});
  });
});
```

#### Iteration Focus
- Handle all `FunctionalTestSpec.target.kind` variants (function, method, CLI, HTTP)
- Proper import path resolution (relative, absolute, aliased)
- Mock scaffolding for `requires_mocks: true` cases
- Error handling for invalid specs

#### Files to Create
```
src/renderers/
├── python/
│   ├── pytest-renderer.ts
│   ├── fixtures.ts
│   └── templates/
├── typescript/
│   ├── vitest-renderer.ts
│   ├── jest-renderer.ts
│   └── templates/
└── common/
    ├── spec-validator.ts
    └── path-resolver.ts
```

---

### 4. Functional Test Execution & Grading (`/src/runners/`)

**Complexity**: Medium-High
**Iteration Required**: Moderate

#### What It Does
Executes generated tests and produces structured results for Promptfoo. **Tests must be functional, never just syntax checks.**

#### Pytest Execution

```bash
pytest .evaluclaude/tests/ \
  --json-report \
  --json-report-file=.evaluclaude/results/pytest.json
```

**Key Packages**:
- `pytest-json-report`: Structured JSON output with per-test pass/fail
- `pytest.main()`: Programmatic invocation from Python

**JSON Report Structure**:
```json
{
  "summary": { "passed": 5, "failed": 1, "total": 6 },
  "tests": [
    {
      "nodeid": "test_auth.py::test_login_success",
      "outcome": "passed",
      "duration": 0.023
    }
  ]
}
```

#### Vitest/Jest Execution

```bash
vitest run .evaluclaude/tests/ --reporter=json --outputFile=.evaluclaude/results/vitest.json
```

**Programmatic (Vitest)**:
```typescript
import { startVitest } from 'vitest/node';

const vitest = await startVitest('test', ['.evaluclaude/tests/']);
const results = vitest.state.getFiles();
```

#### Grader Interface for Promptfoo

```typescript
// graders/deterministic/run-tests.ts
export async function getAssert(output: string, context: AssertContext): Promise<GradingResult> {
  const testResults = await runTests(context.vars.runtime);

  return {
    pass: testResults.failed === 0,
    score: testResults.passed / testResults.total,
    reason: testResults.failed > 0
      ? `${testResults.failed} tests failed`
      : 'All tests passed',
    componentResults: testResults.tests.map(t => ({
      pass: t.outcome === 'passed',
      score: t.outcome === 'passed' ? 1 : 0,
      reason: t.outcome,
      assertion: t.nodeid
    }))
  };
}
```

#### Iteration Focus
- Isolated test environments (clean state per run)
- Timeout handling for long-running tests
- Parallel execution where safe
- Rich error reporting with stack traces

#### Files to Create
```
src/runners/
├── pytest-runner.ts
├── vitest-runner.ts
├── result-parser.ts
└── grader-adapter.ts   # Promptfoo GradingResult format

graders/
├── deterministic/
│   ├── run-tests.py    # Python grader entry
│   └── run-tests.ts    # TypeScript grader entry
└── templates/
    └── grader-wrapper.ts
```

---

### 5. LLM Rubric Graders (`/src/graders/llm/`)

**Complexity**: Medium
**Iteration Required**: High (calibration)

#### What It Does
Uses Claude to grade subjective qualities (code clarity, error message helpfulness, documentation completeness) via structured rubrics.

#### Custom Promptfoo Provider

```typescript
// providers/evaluclaude-grader.ts
import { query, ClaudeAgentOptions } from 'claude-agent-sdk';

export async function call(
  input: string,  // Candidate output to grade
  context: { vars?: Record<string, any>; config?: { rubricId: string } }
) {
  const rubric = loadRubric(context.config.rubricId);

  const systemPrompt = `You are a deterministic grader.
Use the rubric to assign scores. Output JSON only.
Do not use randomness. If unsure, choose the lower score.`;

  const messages = [];
  for await (const msg of query({
    prompt: JSON.stringify({ rubric, candidateOutput: input }),
    options: {
      system_prompt: systemPrompt,
      allowed_tools: [],  // No tools needed for grading
    }
  })) {
    messages.push(msg);
  }

  return { output: extractGradeFromMessages(messages) };
}
```

#### Rubric Structure

```yaml
# rubrics/code-quality.yaml
id: code_quality
description: Evaluates code changes for quality
dimensions:
  - name: correctness
    weight: 0.4
    scale:
      min: 1
      max: 5
      definitions:
        - score: 5
          description: "Code is completely correct, handles all edge cases"
        - score: 3
          description: "Code works for common cases, misses some edge cases"
        - score: 1
          description: "Code has significant bugs or doesn't work"

  - name: clarity
    weight: 0.3
    scale:
      min: 1
      max: 5
      definitions:
        - score: 5
          description: "Code is self-documenting, easy to understand"
        - score: 3
          description: "Code is understandable with some effort"
        - score: 1
          description: "Code is confusing or poorly organized"

  - name: efficiency
    weight: 0.3
    scale:
      min: 1
      max: 5
      definitions:
        - score: 5
          description: "Optimal algorithm and implementation"
        - score: 3
          description: "Reasonable performance, not optimal"
        - score: 1
          description: "Significant performance issues"

overall_scoring: weighted_average
```

#### Iteration Focus
- Calibration against human judgments
- Consistency across runs (temperature=0, fixed seeds if available)
- Rubric design for different eval types
- Version control for rubrics (drift detection)

#### Files to Create
```
src/graders/
├── llm/
│   ├── provider.ts         # Promptfoo custom provider
│   ├── rubric-loader.ts
│   └── grade-parser.ts
└── calibration/
    ├── benchmark-cases/    # Known good/bad examples
    └── calibrator.ts       # Compare LLM grades to human

rubrics/
├── code-quality.yaml
├── error-messages.yaml
├── documentation.yaml
└── api-design.yaml
```

---

### 6. Observability & Tracing (`/src/observability/`)

**Complexity**: Medium
**Iteration Required**: Moderate
**Priority**: 🟡 IMPORTANT — Debug-ability depends on this

#### What It Does
Captures a complete trace of every eval run so you can click into any result and see exactly what happened. No black boxes.

#### What Gets Traced

Every eval (deterministic OR LLM-graded) produces a trace:

```typescript
interface EvalTrace {
  id: string;
  timestamp: Date;
  evalId: string;  // Links to Promptfoo test case

  // What the introspector found
  introspection: {
    filesScanned: number;
    modulesExtracted: number;
    duration: number;
  };

  // Claude's analysis session
  analysis: {
    // Every tool Claude called
    toolCalls: {
      tool: string;
      input: unknown;
      output: unknown;
      duration: number;
    }[];

    // Questions asked (if interactive)
    questions: {
      question: string;
      options: string[];
      userAnswer: string;
    }[];

    // Claude's reasoning (from thinking blocks if available)
    reasoning?: string;

    // Final spec generated
    specsGenerated: string[];  // IDs of FunctionalTestSpecs
  };

  // Test execution
  execution: {
    testsRun: number;
    passed: number;
    failed: number;
    sandboxed: boolean;
    duration: number;
  };

  // For LLM-graded evals
  grading?: {
    rubricUsed: string;
    dimensionScores: Record<string, number>;
    finalScore: number;
    reasoning: string;
  };
}
```

#### Hook-Based Collection

```typescript
// Automatically collect traces via SDK hooks
const traceCollector: TraceCollector = new TraceCollector();

const options: ClaudeAgentOptions = {
  hooks: {
    PreToolUse: [{
      hooks: [async (input, toolUseId) => {
        traceCollector.startToolCall(toolUseId, input.tool_name, input.tool_input);
        return {};
      }]
    }],
    PostToolUse: [{
      hooks: [async (input, toolUseId) => {
        traceCollector.endToolCall(toolUseId, input.tool_response);
        return {};
      }]
    }],
    UserPromptSubmit: [{
      hooks: [async (input) => {
        traceCollector.recordPrompt(input.prompt);
        return {};
      }]
    }]
  }
};
```

#### UI Integration

Traces are stored as JSON and surfaced in the Promptfoo UI:

```yaml
# In generated promptfooconfig.yaml
defaultTest:
  metadata:
    traceFile: .evaluclaude/traces/{{evalId}}.json
```

When you click an eval in Promptfoo's web UI, you see:
1. **Overview**: Pass/fail, duration, cost
2. **Introspection**: What files were analyzed
3. **Claude's Journey**: Every tool call, every question asked
4. **Reasoning**: Why Claude made the decisions it did
5. **Execution**: Which tests ran, which failed

#### Iteration Focus
- Efficient storage (traces can get large)
- Clean UI formatting (collapsible sections, syntax highlighting)
- Linking traces to specific test failures
- Diff view for comparing traces between runs

#### Files to Create
```
src/observability/
├── tracer.ts           # Hook-based collection
├── trace-store.ts      # Persist to .evaluclaude/traces/
├── trace-viewer.ts     # Format for display
└── types.ts            # EvalTrace interface

templates/
└── trace-ui/           # Custom Promptfoo view components
```

---

## Technology Reference

### Claude Agent SDK

| Feature | Usage |
|---------|-------|
| `query()` | One-off tasks, stateless |
| `ClaudeSDKClient` | Multi-turn, sessions, questions |
| `AskUserQuestion` | Clarifying questions during generation |
| `can_use_tool` | Permission callback for questions |
| Local Auth | Uses `claude` CLI authentication |

**Key Flags**:
- `permission_mode: 'acceptEdits'` — Auto-approve file changes
- `allowed_tools: [...]` — Restrict tool access
- `setting_sources: ['project']` — Load CLAUDE.md

### Promptfoo

| Feature | Usage |
|---------|-------|
| Python Provider | `file://providers/agent.py` |
| Python Assertions | `file://graders/check.py` |
| LLM Rubrics | `llm-rubric:` assertion type |
| Custom Provider | For Claude Agent SDK integration |
| JSON Reports | `promptfoo eval -o results.json` |

**Python Grader Return Types**:
```python
# Boolean
return True  # pass
return False # fail

# Score (0-1)
return 0.85

# GradingResult
return {
    'pass': True,
    'score': 0.85,
    'reason': 'All checks passed',
    'componentResults': [...]
}
```

### Test Runners

| Runner | JSON Output | Programmatic API |
|--------|-------------|------------------|
| pytest | `pytest-json-report` | `pytest.main([...])` |
| Vitest | `--reporter=json` | `startVitest('test', [...])` |
| Jest | `jest-ctrf-json-reporter` | `runCLI({...})` |

---

## Directory Structure (Final)

```
evaluclaude-harness/
├── src/
│   ├── cli/                    # Commander.js CLI
│   │   ├── index.ts
│   │   ├── commands/
│   │   │   ├── init.ts         # Full analysis + questions
│   │   │   ├── generate.ts     # Incremental (git diff only)
│   │   │   ├── run.ts          # Execute evals
│   │   │   └── view.ts         # Open Promptfoo UI
│   │   └── utils/
│   │
│   ├── introspector/           # 🌳 NON-LLM codebase parsing
│   │   ├── tree-sitter.ts      # Multi-language AST parsing
│   │   ├── python-parser.ts    # Python-specific extraction
│   │   ├── typescript-parser.ts# TS-specific extraction
│   │   ├── git-diff.ts         # 🔄 Incremental change detection
│   │   └── summarizer.ts       # RepoSummary generation
│   │
│   ├── session/                # Claude SDK wrapper
│   │   ├── client.ts           # ClaudeSDKClient wrapper
│   │   ├── question-handler.ts # 💬 AskUserQuestion UI
│   │   ├── modes.ts            # Interactive vs non-interactive
│   │   └── persistence.ts      # Session save/resume
│   │
│   ├── observability/          # 👁️ Full tracing
│   │   ├── tracer.ts           # Hook-based logging
│   │   ├── trace-store.ts      # Persist traces per eval
│   │   └── trace-viewer.ts     # Format for UI display
│   │
│   ├── analyzer/               # LLM-based analysis
│   │   ├── spec-generator.ts   # RepoSummary → EvalSpec
│   │   └── validator.ts        # Validate generated specs
│   │
│   ├── renderers/              # Spec → Test code (deterministic)
│   │   ├── python/
│   │   │   ├── pytest-renderer.ts
│   │   │   └── fixtures.ts
│   │   └── typescript/
│   │       ├── vitest-renderer.ts
│   │       └── jest-renderer.ts
│   │
│   ├── runners/                # 🔒 Sandboxed test execution
│   │   ├── sandbox.ts          # Isolation wrapper
│   │   ├── pytest-runner.ts
│   │   ├── vitest-runner.ts
│   │   └── result-parser.ts
│   │
│   └── graders/                # LLM grading
│       ├── llm/
│       │   ├── provider.ts     # Promptfoo custom provider
│       │   └── rubric-loader.ts
│       ├── deterministic/
│       │   └── test-grader.ts
│       └── calibration/
│           └── calibrator.ts
│
├── prompts/                    # LLM prompts (iterable)
│   ├── analyzer-system.md      # Core identity + constraints
│   ├── analyzer-developer.md   # Schema + formatting
│   └── analyzer-user.md        # Template for RepoSummary
│
├── rubrics/                    # Grading rubrics
│   ├── code-quality.yaml
│   ├── error-messages.yaml
│   └── documentation.yaml
│
├── templates/                  # Generated file templates
│   ├── promptfooconfig.yaml
│   └── ...
│
├── tests/                      # Our own tests
├── package.json
├── tsconfig.json
└── README.md
```

---

## Development Phases

### Phase 1: Foundation (Days 1-2)
- [ ] CLI scaffold with Commander.js
- [ ] 🌳 **Tree-sitter introspector** — This is foundational, do it first
- [ ] RepoSummary type definitions
- [ ] Basic Claude SDK session wrapper

### Phase 2: Analysis (Days 2-4)
- [ ] Analyzer prompt v1 (system + developer + user)
- [ ] EvalSpec schema + validation
- [ ] 💬 **AskUserQuestion flow** — Interactive mode with CLI prompts
- [ ] Non-interactive fallback mode

### Phase 3: Observability (Days 3-4)
- [ ] 👁️ **Hook-based tracing** — Capture every tool call
- [ ] Trace storage (.evaluclaude/traces/)
- [ ] Basic trace viewer formatting

### Phase 4: Renderers (Days 4-5)
- [ ] Python/pytest renderer
- [ ] TypeScript/Vitest renderer
- [ ] Spec validation before rendering
- [ ] 🔄 **Git-aware incremental** — Only regenerate for changed files

### Phase 5: Execution (Days 5-6)
- [ ] 🔒 **Sandbox mode** — Isolated test execution
- [ ] Test runners (pytest, Vitest)
- [ ] Result parsing and aggregation
- [ ] Promptfoo integration

### Phase 6: Grading (Days 6-7)
- [ ] LLM grader provider
- [ ] Rubric system
- [ ] Calibration tooling

### Phase 7: Polish (Day 7+)
- [ ] Error handling and recovery
- [ ] Trace UI improvements
- [ ] Documentation
- [ ] Example repos for testing

---

## Key Design Decisions

1. **Claude generates specs, not code**: Test code is deterministically rendered from specs. This ensures reliability and maintainability.

2. **Functional tests only**: Every test must invoke actual code. No syntax checks, no format validation, no "output looks good" assertions.

3. **Language-agnostic schema**: One analyzer prompt, multiple renderers. Adding new languages means adding renderers, not prompts.

4. **Two-mode operation**: Interactive for development (questions allowed), non-interactive for CI (best-effort, no blocking).

5. **Promptfoo as orchestrator**: We do the heavy lifting; Promptfoo handles parallelism, caching, and UI.

6. **🌳 Tree-sitter over token burn**: Never send raw code to Claude for structure extraction. Parse locally, send summaries.

7. **🔄 Incremental by default**: `generate` only re-analyzes git diff. Full analysis is opt-in via `init --full`.

8. **👁️ No black boxes**: Every eval has a trace. You can always see what Claude did and why.

9. **🔒 Sandbox execution**: Generated tests run in isolation. Assume they might do anything.

10. **💬 Conversation > commands**: Claude asks clarifying questions. This isn't a fire-and-forget generator.

---

## Success Criteria

- [ ] `npx evaluclaude-harness init` works on a fresh Python/TS repo
- [ ] Generated tests actually run and catch real bugs
- [ ] LLM graders correlate with human judgment (>80% agreement)
- [ ] Full pipeline runs in <5 minutes for medium repo
- [ ] Zero manual config required for basic usage
- [ ] 🌳 Introspector handles 1000+ file repos in <10 seconds
- [ ] 🔄 Incremental `generate` is 10x faster than full `init`
- [ ] 👁️ Every eval result is traceable to Claude's decisions
- [ ] 💬 Claude asks at least 2-3 clarifying questions on complex repos
- [ ] 🔒 No test can escape sandbox to affect host system