# Evaluclaude Harness - Development Plan > **Goal**: Zero-to-evals in one command. Install into any codebase, Claude analyzes it, generates real functional tests. --- ## The Vibe This isn't just another test generator. It's a **collaborative eval system** where Claude actually understands your codebase and asks smart questions before generating tests. The key insights: 1. **Claude generates specs, we generate code** — Claude is great at understanding intent, bad at deterministic code gen. We use its strengths. 2. **Functional tests only** — Every test must invoke actual code. No syntax checks. No "output looks good" vibes. Real assertions that catch real bugs. 3. **Conversation, not commands** — During init, Claude asks questions like "I see 3 database models. Which is the core one?" This turns a dumb generator into a thinking partner. 4. **Full observability** — Every eval (deterministic or LLM-graded) has a trace. You can click into it and see exactly what Claude was thinking. --- ## Core Principles (Non-Negotiable) These are foundational. Don't compromise on them. ### 🌳 Tree-Sitter Introspector Claude should **never see raw code** for structure extraction. Use tree-sitter to parse Python/TypeScript and extract: - Function signatures - Class hierarchies - Import graphs - Public APIs Then send Claude a **structured summary**, not the actual files. This saves tokens, is faster, and more reliable. ### 🔄 Git-Aware Incremental Generation - `init` command → Full codebase analysis - `generate` command → Only analyze git diff since last run Don't re-analyze unchanged files. Massive time/cost savings. ### 🐛 Hooks for Debugging You WILL need to debug why Claude generated bad specs. Log every tool call: ```typescript hooks: { PostToolUse: [{ hooks: [async (input) => { await trace.log({ tool: input.tool_name, input: input.tool_input }); return {}; }] }] } ``` ### 💬 AskUserQuestion is Gold During init, Claude should ask clarifying questions: - "I see 3 database models. Which is the core domain object?" - "This API has no tests. Should I generate CRUD tests or skip it?" - "Found a `config.example.py`. Should tests use these values?" - "There are 47 utility functions. Want me to prioritize the 10 most-used?" This transforms eval generation from "spray and pray" to thoughtful, targeted tests. ### 👁️ Full Observability Every eval run (deterministic AND LLM-graded) must produce a trace: - What files Claude read - What questions it asked - What specs it generated - The thinking behind each decision In the UI, you click an eval and see the entire reasoning chain. No black boxes. ### 🔒 Sandbox Mode for Test Execution Generated tests might do unexpected things. Run them in isolation: ```typescript sandbox: { enabled: true, autoAllowBashIfSandboxed: true, network: { allowLocalBinding: true } } ``` --- ## Architecture Overview ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ evaluclaude-harness CLI │ ├──────────────────────────────────────────────────────────────────────────┤ │ init │ generate │ run │ view │ validate │ calibrate │ └────────────────────────────┬─────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────┐ │ Claude Agent SDK (Local Auth) │ │ Uses locally authenticated Claude Code instance │ └────────────────────────────┬─────────────────────────────────────────────┘ │ ┌────────────────────┴────────────────────┐ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ Analyzer Agent │ │ Grader Agent │ │ (Spec Generator) │ │ (LLM Rubrics) │ └─────────┬─────────┘ └─────────┬─────────┘ │ │ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ │ EvalSpec JSON │ │ Rubrics JSON │ │ (Deterministic) │ │ (Subjective) │ └─────────┬─────────┘ └─────────┬─────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────┐ │ Test Renderers │ │ Python (pytest) │ TypeScript (Vitest/Jest) │ Grader Scripts │ └────────────────────────────┬─────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────┐ │ Promptfoo │ │ Orchestration │ Execution │ Results │ Web UI │ └──────────────────────────────────────────────────────────────────────────┘ ``` --- ## Iteration Areas (Split Work) The project has **6 major iteration areas** that can be developed in parallel by different agents/sessions. Each requires significant thought and refinement. > **Note for agents**: Each area is self-contained. You can work on one without deep knowledge of the others. The interfaces between them are defined by TypeScript types (EvalSpec, RepoSummary, TraceEvent, etc.). --- ### 0. Tree-Sitter Introspector (`/src/introspector/`) **Complexity**: Medium **Iteration Required**: Moderate **Priority**: 🔴 FOUNDATIONAL — Build this first #### What It Does Parses Python and TypeScript codebases using tree-sitter to extract structured information WITHOUT using LLM tokens. This is the foundation — Claude never sees raw code for structure extraction. #### Key Outputs ```typescript interface RepoSummary { languages: ('python' | 'typescript')[]; root: string; // File inventory (no content, just structure) files: { path: string; lang: 'python' | 'typescript' | 'other'; role: 'source' | 'test' | 'config' | 'docs'; size: number; }[]; // Extracted structure (from tree-sitter, NOT from reading files) modules: { path: string; exports: { name: string; kind: 'function' | 'class' | 'constant' | 'type'; signature?: string; // e.g., "(user_id: int, include_deleted: bool = False) -> User" docstring?: string; // First line only }[]; imports: string[]; // What this module depends on }[]; // Config detection config: { python?: { entryPoints: string[]; testFramework: 'pytest' | 'unittest' | 'none'; hasTyping: boolean; }; typescript?: { entryPoints: string[]; testFramework: 'vitest' | 'jest' | 'none'; hasTypes: boolean; }; }; // Git info for incremental git?: { lastAnalyzedCommit: string; changedSince: string[]; // Files changed since last analysis }; } ``` #### Tree-Sitter Integration ```typescript import Parser from 'tree-sitter'; import Python from 'tree-sitter-python'; import TypeScript from 'tree-sitter-typescript'; const parser = new Parser(); parser.setLanguage(Python); const tree = parser.parse(sourceCode); // Extract function signatures const query = new Parser.Query(Python, ` (function_definition name: (identifier) @name parameters: (parameters) @params return_type: (type)? @return ) @func `); const matches = query.matches(tree.rootNode); ``` #### Git-Aware Incremental ```typescript // On `generate` command, only re-analyze changed files async function getChangedFiles(since: string): Promise { const result = await exec(`git diff --name-only ${since}`); return result.stdout.split('\n').filter(f => isSourceFile(f)); } // Skip unchanged modules in RepoSummary const incrementalSummary = await introspector.analyze({ onlyFiles: await getChangedFiles(lastCommit) }); ``` #### Iteration Focus - Parse accuracy across different Python/TS coding styles - Handle edge cases: decorators, async functions, generics - Performance on large codebases (>1000 files) - Incremental updates without full re-parse #### Files to Create ``` src/introspector/ ├── tree-sitter.ts # Core parser wrapper ├── python-parser.ts # Python-specific queries ├── typescript-parser.ts# TS-specific queries ├── git-diff.ts # Incremental change detection └── summarizer.ts # Combine into RepoSummary ``` --- ### 1. Codebase Analyzer Prompt (`/prompts/analyzer.md`) **Complexity**: High **Iteration Required**: Extensive #### What It Does The core LLM prompt that analyzes any Python/TypeScript codebase and outputs a structured `EvalSpec` JSON. Must be language-agnostic in design but produce language-specific output. #### Key Challenges - Must work across diverse project structures (monorepos, microservices, simple scripts) - Must identify *functional behaviors* not just code structure - Must produce deterministic, unambiguous test specifications - Must avoid proposing tests that depend on network, time, randomness, or external state #### EvalSpec Schema (Output Target) ```typescript interface EvalSpec { functional_tests: FunctionalTestSpec[]; rubric_graders: RubricGraderSpec[]; } interface FunctionalTestSpec { id: string; description: string; target: { runtime: 'python' | 'typescript'; module_path: string; // e.g., "src/utils/math.py" import_path: string; // e.g., "project.utils.math" callable: string; // function/method name kind: 'function' | 'method' | 'cli' | 'http_handler'; }; setup?: { env?: Record; files?: { path: string; content: string }[]; }; cases: { id: string; description: string; inputs: unknown; expected: { return_value?: unknown; stdout_contains?: string[]; stderr_contains?: string[]; json_schema?: unknown; raises_exception?: string; requires_mocks?: boolean; }; }[]; } interface RubricGraderSpec { id: string; description: string; target: { kind: 'cli_output' | 'http_response' | 'error_message' | 'doc_page'; location: string; }; rubric: { dimensions: { name: string; weight: number; scale: { min: number; max: number; definitions: { score: number; description: string }[]; }; }[]; overall_scoring: 'weighted_average'; }; } ``` #### Prompt Structure (Multi-Part) 1. **System Prompt**: Role definition, constraints, JSON-only output 2. **Developer Prompt**: Schema definition, formatting rules 3. **User Prompt**: `RepoSummary` JSON + specific instructions #### Iteration Focus - Test against diverse repos: CLI tools, web apps, libraries, ML projects - Refine heuristics for identifying "testable" vs "non-testable" code - Tune specificity of test case generation - Handle edge cases: no tests exist, tests exist but are bad, etc. #### Files to Create ``` prompts/ ├── analyzer-system.md # Core identity + constraints ├── analyzer-developer.md # Schema + formatting └── analyzer-user.md # Template for RepoSummary injection ``` --- ### 2. Synchronous Claude Session with Questions (`/src/session/`) **Complexity**: Medium-High **Iteration Required**: Moderate #### What It Does Runs Claude Agent SDK synchronously, handles `AskUserQuestion` tool calls, collects user input via CLI, and returns answers to continue the agent. #### Key Technical Details **Claude Agent SDK Patterns**: ```typescript // Using ClaudeSDKClient for multi-turn with questions import { ClaudeSDKClient, ClaudeAgentOptions } from 'claude-agent-sdk'; const options: ClaudeAgentOptions = { allowed_tools: ['Read', 'Glob', 'Grep', 'AskUserQuestion'], permission_mode: 'acceptEdits', can_use_tool: async (toolName, input, context) => { if (toolName === 'AskUserQuestion') { // Display questions to user const answers = await promptUserForAnswers(input.questions); return { behavior: 'allow', updatedInput: { ...input, answers } }; } return { behavior: 'allow', updatedInput: input }; } }; ``` **Two Operating Modes**: | Mode | Behavior | Use Case | |------|----------|----------| | `--interactive` | Questions allowed, waits for user | Local development | | `--non-interactive` | Questions forbidden, best-effort | CI/CD, automation | #### Iteration Focus - Clean CLI UX for question display (multi-choice, free text) - Timeout handling (60s limit per question) - Graceful fallback when questions are disabled - Session persistence for resuming interrupted generation #### Files to Create ``` src/session/ ├── client.ts # ClaudeSDKClient wrapper ├── question-handler.ts # AskUserQuestion UI ├── modes.ts # Interactive vs non-interactive └── persistence.ts # Session save/resume ``` --- ### 3. Test Renderers (Deterministic Code Gen) (`/src/renderers/`) **Complexity**: Medium **Iteration Required**: Moderate #### What It Does Transforms `EvalSpec` JSON into actual runnable test files. **Claude generates specs, we generate code** — this is critical for reliability. #### Python Renderer (pytest) **Input**: `FunctionalTestSpec` **Output**: `.py` file in `.evaluclaude/tests/` ```python # Generated: .evaluclaude/tests/test_{id}.py import importlib import pytest module = importlib.import_module("{import_path}") target = getattr(module, "{callable}") @pytest.mark.evaluclaude def test_{id}_{case_id}(): result = target(**{inputs}) assert result == {expected.return_value} ``` **Features**: - `pytest` fixtures for env vars (`monkeypatch.setenv`) - `tmp_path` for file setup - `capsys` for stdout/stderr assertions - Parameterized tests for multiple cases #### TypeScript Renderer (Vitest/Jest) **Input**: `FunctionalTestSpec` **Output**: `.test.ts` file in `.evaluclaude/tests/` ```typescript // Generated: .evaluclaude/tests/{id}.test.ts import { describe, it, expect } from 'vitest'; import { callable } from '{import_path}'; describe('{description}', () => { it('{case.description}', () => { const result = callable({inputs}); expect(result).toEqual({expected.return_value}); }); }); ``` #### Iteration Focus - Handle all `FunctionalTestSpec.target.kind` variants (function, method, CLI, HTTP) - Proper import path resolution (relative, absolute, aliased) - Mock scaffolding for `requires_mocks: true` cases - Error handling for invalid specs #### Files to Create ``` src/renderers/ ├── python/ │ ├── pytest-renderer.ts │ ├── fixtures.ts │ └── templates/ ├── typescript/ │ ├── vitest-renderer.ts │ ├── jest-renderer.ts │ └── templates/ └── common/ ├── spec-validator.ts └── path-resolver.ts ``` --- ### 4. Functional Test Execution & Grading (`/src/runners/`) **Complexity**: Medium-High **Iteration Required**: Moderate #### What It Does Executes generated tests and produces structured results for Promptfoo. **Tests must be functional, never just syntax checks.** #### Pytest Execution ```bash pytest .evaluclaude/tests/ \ --json-report \ --json-report-file=.evaluclaude/results/pytest.json ``` **Key Packages**: - `pytest-json-report`: Structured JSON output with per-test pass/fail - `pytest.main()`: Programmatic invocation from Python **JSON Report Structure**: ```json { "summary": { "passed": 5, "failed": 1, "total": 6 }, "tests": [ { "nodeid": "test_auth.py::test_login_success", "outcome": "passed", "duration": 0.023 } ] } ``` #### Vitest/Jest Execution ```bash vitest run .evaluclaude/tests/ --reporter=json --outputFile=.evaluclaude/results/vitest.json ``` **Programmatic (Vitest)**: ```typescript import { startVitest } from 'vitest/node'; const vitest = await startVitest('test', ['.evaluclaude/tests/']); const results = vitest.state.getFiles(); ``` #### Grader Interface for Promptfoo ```typescript // graders/deterministic/run-tests.ts export async function getAssert(output: string, context: AssertContext): Promise { const testResults = await runTests(context.vars.runtime); return { pass: testResults.failed === 0, score: testResults.passed / testResults.total, reason: testResults.failed > 0 ? `${testResults.failed} tests failed` : 'All tests passed', componentResults: testResults.tests.map(t => ({ pass: t.outcome === 'passed', score: t.outcome === 'passed' ? 1 : 0, reason: t.outcome, assertion: t.nodeid })) }; } ``` #### Iteration Focus - Isolated test environments (clean state per run) - Timeout handling for long-running tests - Parallel execution where safe - Rich error reporting with stack traces #### Files to Create ``` src/runners/ ├── pytest-runner.ts ├── vitest-runner.ts ├── result-parser.ts └── grader-adapter.ts # Promptfoo GradingResult format graders/ ├── deterministic/ │ ├── run-tests.py # Python grader entry │ └── run-tests.ts # TypeScript grader entry └── templates/ └── grader-wrapper.ts ``` --- ### 5. LLM Rubric Graders (`/src/graders/llm/`) **Complexity**: Medium **Iteration Required**: High (calibration) #### What It Does Uses Claude to grade subjective qualities (code clarity, error message helpfulness, documentation completeness) via structured rubrics. #### Custom Promptfoo Provider ```typescript // providers/evaluclaude-grader.ts import { query, ClaudeAgentOptions } from 'claude-agent-sdk'; export async function call( input: string, // Candidate output to grade context: { vars?: Record; config?: { rubricId: string } } ) { const rubric = loadRubric(context.config.rubricId); const systemPrompt = `You are a deterministic grader. Use the rubric to assign scores. Output JSON only. Do not use randomness. If unsure, choose the lower score.`; const messages = []; for await (const msg of query({ prompt: JSON.stringify({ rubric, candidateOutput: input }), options: { system_prompt: systemPrompt, allowed_tools: [], // No tools needed for grading } })) { messages.push(msg); } return { output: extractGradeFromMessages(messages) }; } ``` #### Rubric Structure ```yaml # rubrics/code-quality.yaml id: code_quality description: Evaluates code changes for quality dimensions: - name: correctness weight: 0.4 scale: min: 1 max: 5 definitions: - score: 5 description: "Code is completely correct, handles all edge cases" - score: 3 description: "Code works for common cases, misses some edge cases" - score: 1 description: "Code has significant bugs or doesn't work" - name: clarity weight: 0.3 scale: min: 1 max: 5 definitions: - score: 5 description: "Code is self-documenting, easy to understand" - score: 3 description: "Code is understandable with some effort" - score: 1 description: "Code is confusing or poorly organized" - name: efficiency weight: 0.3 scale: min: 1 max: 5 definitions: - score: 5 description: "Optimal algorithm and implementation" - score: 3 description: "Reasonable performance, not optimal" - score: 1 description: "Significant performance issues" overall_scoring: weighted_average ``` #### Iteration Focus - Calibration against human judgments - Consistency across runs (temperature=0, fixed seeds if available) - Rubric design for different eval types - Version control for rubrics (drift detection) #### Files to Create ``` src/graders/ ├── llm/ │ ├── provider.ts # Promptfoo custom provider │ ├── rubric-loader.ts │ └── grade-parser.ts └── calibration/ ├── benchmark-cases/ # Known good/bad examples └── calibrator.ts # Compare LLM grades to human rubrics/ ├── code-quality.yaml ├── error-messages.yaml ├── documentation.yaml └── api-design.yaml ``` --- ### 6. Observability & Tracing (`/src/observability/`) **Complexity**: Medium **Iteration Required**: Moderate **Priority**: 🟡 IMPORTANT — Debug-ability depends on this #### What It Does Captures a complete trace of every eval run so you can click into any result and see exactly what happened. No black boxes. #### What Gets Traced Every eval (deterministic OR LLM-graded) produces a trace: ```typescript interface EvalTrace { id: string; timestamp: Date; evalId: string; // Links to Promptfoo test case // What the introspector found introspection: { filesScanned: number; modulesExtracted: number; duration: number; }; // Claude's analysis session analysis: { // Every tool Claude called toolCalls: { tool: string; input: unknown; output: unknown; duration: number; }[]; // Questions asked (if interactive) questions: { question: string; options: string[]; userAnswer: string; }[]; // Claude's reasoning (from thinking blocks if available) reasoning?: string; // Final spec generated specsGenerated: string[]; // IDs of FunctionalTestSpecs }; // Test execution execution: { testsRun: number; passed: number; failed: number; sandboxed: boolean; duration: number; }; // For LLM-graded evals grading?: { rubricUsed: string; dimensionScores: Record; finalScore: number; reasoning: string; }; } ``` #### Hook-Based Collection ```typescript // Automatically collect traces via SDK hooks const traceCollector: TraceCollector = new TraceCollector(); const options: ClaudeAgentOptions = { hooks: { PreToolUse: [{ hooks: [async (input, toolUseId) => { traceCollector.startToolCall(toolUseId, input.tool_name, input.tool_input); return {}; }] }], PostToolUse: [{ hooks: [async (input, toolUseId) => { traceCollector.endToolCall(toolUseId, input.tool_response); return {}; }] }], UserPromptSubmit: [{ hooks: [async (input) => { traceCollector.recordPrompt(input.prompt); return {}; }] }] } }; ``` #### UI Integration Traces are stored as JSON and surfaced in the Promptfoo UI: ```yaml # In generated promptfooconfig.yaml defaultTest: metadata: traceFile: .evaluclaude/traces/{{evalId}}.json ``` When you click an eval in Promptfoo's web UI, you see: 1. **Overview**: Pass/fail, duration, cost 2. **Introspection**: What files were analyzed 3. **Claude's Journey**: Every tool call, every question asked 4. **Reasoning**: Why Claude made the decisions it did 5. **Execution**: Which tests ran, which failed #### Iteration Focus - Efficient storage (traces can get large) - Clean UI formatting (collapsible sections, syntax highlighting) - Linking traces to specific test failures - Diff view for comparing traces between runs #### Files to Create ``` src/observability/ ├── tracer.ts # Hook-based collection ├── trace-store.ts # Persist to .evaluclaude/traces/ ├── trace-viewer.ts # Format for display └── types.ts # EvalTrace interface templates/ └── trace-ui/ # Custom Promptfoo view components ``` --- ## Technology Reference ### Claude Agent SDK | Feature | Usage | |---------|-------| | `query()` | One-off tasks, stateless | | `ClaudeSDKClient` | Multi-turn, sessions, questions | | `AskUserQuestion` | Clarifying questions during generation | | `can_use_tool` | Permission callback for questions | | Local Auth | Uses `claude` CLI authentication | **Key Flags**: - `permission_mode: 'acceptEdits'` — Auto-approve file changes - `allowed_tools: [...]` — Restrict tool access - `setting_sources: ['project']` — Load CLAUDE.md ### Promptfoo | Feature | Usage | |---------|-------| | Python Provider | `file://providers/agent.py` | | Python Assertions | `file://graders/check.py` | | LLM Rubrics | `llm-rubric:` assertion type | | Custom Provider | For Claude Agent SDK integration | | JSON Reports | `promptfoo eval -o results.json` | **Python Grader Return Types**: ```python # Boolean return True # pass return False # fail # Score (0-1) return 0.85 # GradingResult return { 'pass': True, 'score': 0.85, 'reason': 'All checks passed', 'componentResults': [...] } ``` ### Test Runners | Runner | JSON Output | Programmatic API | |--------|-------------|------------------| | pytest | `pytest-json-report` | `pytest.main([...])` | | Vitest | `--reporter=json` | `startVitest('test', [...])` | | Jest | `jest-ctrf-json-reporter` | `runCLI({...})` | --- ## Directory Structure (Final) ``` evaluclaude-harness/ ├── src/ │ ├── cli/ # Commander.js CLI │ │ ├── index.ts │ │ ├── commands/ │ │ │ ├── init.ts # Full analysis + questions │ │ │ ├── generate.ts # Incremental (git diff only) │ │ │ ├── run.ts # Execute evals │ │ │ └── view.ts # Open Promptfoo UI │ │ └── utils/ │ │ │ ├── introspector/ # 🌳 NON-LLM codebase parsing │ │ ├── tree-sitter.ts # Multi-language AST parsing │ │ ├── python-parser.ts # Python-specific extraction │ │ ├── typescript-parser.ts# TS-specific extraction │ │ ├── git-diff.ts # 🔄 Incremental change detection │ │ └── summarizer.ts # RepoSummary generation │ │ │ ├── session/ # Claude SDK wrapper │ │ ├── client.ts # ClaudeSDKClient wrapper │ │ ├── question-handler.ts # 💬 AskUserQuestion UI │ │ ├── modes.ts # Interactive vs non-interactive │ │ └── persistence.ts # Session save/resume │ │ │ ├── observability/ # 👁️ Full tracing │ │ ├── tracer.ts # Hook-based logging │ │ ├── trace-store.ts # Persist traces per eval │ │ └── trace-viewer.ts # Format for UI display │ │ │ ├── analyzer/ # LLM-based analysis │ │ ├── spec-generator.ts # RepoSummary → EvalSpec │ │ └── validator.ts # Validate generated specs │ │ │ ├── renderers/ # Spec → Test code (deterministic) │ │ ├── python/ │ │ │ ├── pytest-renderer.ts │ │ │ └── fixtures.ts │ │ └── typescript/ │ │ ├── vitest-renderer.ts │ │ └── jest-renderer.ts │ │ │ ├── runners/ # 🔒 Sandboxed test execution │ │ ├── sandbox.ts # Isolation wrapper │ │ ├── pytest-runner.ts │ │ ├── vitest-runner.ts │ │ └── result-parser.ts │ │ │ └── graders/ # LLM grading │ ├── llm/ │ │ ├── provider.ts # Promptfoo custom provider │ │ └── rubric-loader.ts │ ├── deterministic/ │ │ └── test-grader.ts │ └── calibration/ │ └── calibrator.ts │ ├── prompts/ # LLM prompts (iterable) │ ├── analyzer-system.md # Core identity + constraints │ ├── analyzer-developer.md # Schema + formatting │ └── analyzer-user.md # Template for RepoSummary │ ├── rubrics/ # Grading rubrics │ ├── code-quality.yaml │ ├── error-messages.yaml │ └── documentation.yaml │ ├── templates/ # Generated file templates │ ├── promptfooconfig.yaml │ └── ... │ ├── tests/ # Our own tests ├── package.json ├── tsconfig.json └── README.md ``` --- ## Development Phases ### Phase 1: Foundation (Days 1-2) - [ ] CLI scaffold with Commander.js - [ ] 🌳 **Tree-sitter introspector** — This is foundational, do it first - [ ] RepoSummary type definitions - [ ] Basic Claude SDK session wrapper ### Phase 2: Analysis (Days 2-4) - [ ] Analyzer prompt v1 (system + developer + user) - [ ] EvalSpec schema + validation - [ ] 💬 **AskUserQuestion flow** — Interactive mode with CLI prompts - [ ] Non-interactive fallback mode ### Phase 3: Observability (Days 3-4) - [ ] 👁️ **Hook-based tracing** — Capture every tool call - [ ] Trace storage (.evaluclaude/traces/) - [ ] Basic trace viewer formatting ### Phase 4: Renderers (Days 4-5) - [ ] Python/pytest renderer - [ ] TypeScript/Vitest renderer - [ ] Spec validation before rendering - [ ] 🔄 **Git-aware incremental** — Only regenerate for changed files ### Phase 5: Execution (Days 5-6) - [ ] 🔒 **Sandbox mode** — Isolated test execution - [ ] Test runners (pytest, Vitest) - [ ] Result parsing and aggregation - [ ] Promptfoo integration ### Phase 6: Grading (Days 6-7) - [ ] LLM grader provider - [ ] Rubric system - [ ] Calibration tooling ### Phase 7: Polish (Day 7+) - [ ] Error handling and recovery - [ ] Trace UI improvements - [ ] Documentation - [ ] Example repos for testing --- ## Key Design Decisions 1. **Claude generates specs, not code**: Test code is deterministically rendered from specs. This ensures reliability and maintainability. 2. **Functional tests only**: Every test must invoke actual code. No syntax checks, no format validation, no "output looks good" assertions. 3. **Language-agnostic schema**: One analyzer prompt, multiple renderers. Adding new languages means adding renderers, not prompts. 4. **Two-mode operation**: Interactive for development (questions allowed), non-interactive for CI (best-effort, no blocking). 5. **Promptfoo as orchestrator**: We do the heavy lifting; Promptfoo handles parallelism, caching, and UI. 6. **🌳 Tree-sitter over token burn**: Never send raw code to Claude for structure extraction. Parse locally, send summaries. 7. **🔄 Incremental by default**: `generate` only re-analyzes git diff. Full analysis is opt-in via `init --full`. 8. **👁️ No black boxes**: Every eval has a trace. You can always see what Claude did and why. 9. **🔒 Sandbox execution**: Generated tests run in isolation. Assume they might do anything. 10. **💬 Conversation > commands**: Claude asks clarifying questions. This isn't a fire-and-forget generator. --- ## Success Criteria - [ ] `npx evaluclaude-harness init` works on a fresh Python/TS repo - [ ] Generated tests actually run and catch real bugs - [ ] LLM graders correlate with human judgment (>80% agreement) - [ ] Full pipeline runs in <5 minutes for medium repo - [ ] Zero manual config required for basic usage - [ ] 🌳 Introspector handles 1000+ file repos in <10 seconds - [ ] 🔄 Incremental `generate` is 10x faster than full `init` - [ ] 👁️ Every eval result is traceable to Claude's decisions - [ ] 💬 Claude asks at least 2-3 clarifying questions on complex repos - [ ] 🔒 No test can escape sandbox to affect host system