evaluclaude

Zero-to-evals in one command. Claude analyzes your codebase and generates functional tests.

A CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Tree-sitter parses your code structure, Claude generates test specs, and deterministic renderers create the actual tests.

Quick Start

npm install -g evaluclaude-harness
export ANTHROPIC_API_KEY=your-key

# Run the full pipeline
evaluclaude pipeline .

# Or step by step
evaluclaude intro .                    # Parse codebase
evaluclaude analyze . -o spec.json -i  # Generate spec (interactive)
evaluclaude render spec.json           # Create test files
evaluclaude run                        # Execute tests

How It Works

  INTROSPECT          ANALYZE            RENDER             RUN
  Parse code    ->    Generate     ->    Create test   ->   Execute
  with tree-sitter    EvalSpec           files (pytest,     & trace
                      with Claude        vitest, jest)

Commands

Command	Description
`pipeline [path]`	Run full pipeline: intro -> analyze -> render -> run
`intro [path]`	Introspect codebase with tree-sitter
`analyze [path]`	Generate EvalSpec with Claude
`render <spec>`	Render EvalSpec to test files
`run [test-dir]`	Execute tests and collect results
`grade <input>`	Grade output using LLM rubric
`rubrics`	List available rubrics
`calibrate`	Calibrate rubric against examples
`view [trace-id]`	View trace details
`traces`	List all traces
`ui`	Launch Promptfoo dashboard
`eval`	Run Promptfoo evaluations

Supported Languages

Language	Parser	Test Framework
Python	tree-sitter-python	pytest
TypeScript	tree-sitter-typescript	vitest, jest
JavaScript	tree-sitter-typescript	vitest, jest

Output Structure

.evaluclaude/
  spec.json              # Generated EvalSpec
  traces/                # Execution traces
  results/               # Test results
  promptfooconfig.yaml   # Promptfoo config

How This Was Built

This project was built in a few hours using Amp Code. You can explore the development threads:

Development

npm run build      # Build
npm run dev        # Dev mode
npm test           # Run tests
npm run typecheck  # Type check

License

MIT

3.8 KiB Raw Blame History