mirror of https://github.com/harivansh-afk/evaluclaude-harness.git synced 2026-04-15 04:03:29 +00:00

No description

Find a file

Harivansh Rathi 69c08c9d6b ui polish		2026-01-11 20:38:57 -05:00
docs	iteration 0	2026-01-11 16:58:40 -05:00
prompts	grader, test renderer	2026-01-11 18:13:00 -05:00
rubrics	improvements and promptfoo	2026-01-11 20:02:30 -05:00
src	ui polish	2026-01-11 20:38:57 -05:00
.gitignore	iteration 0	2026-01-11 16:58:40 -05:00
AGENTS.md	iteration 0	2026-01-11 16:58:40 -05:00
package-lock.json	grader, test renderer	2026-01-11 18:13:00 -05:00
package.json	grader, test renderer	2026-01-11 18:13:00 -05:00
PLAN.md	iteration 0	2026-01-11 16:58:40 -05:00
README.md	ui polish	2026-01-11 20:38:57 -05:00
tsconfig.json	iteration 0	2026-01-11 16:58:40 -05:00

README.md

evaluclaude

Zero-to-evals in one command. Claude analyzes your codebase and generates functional tests.

What is this?

evaluclaude is a CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Unlike traditional test generators that produce boilerplate, evaluclaude:

Parses your code with tree-sitter (no LLM tokens wasted on structure)
Asks smart questions to understand your testing priorities
Generates specs, not code — deterministic renderers create the actual tests
Full observability — every run produces a trace you can inspect

Quick Start

# Install
npm install -g evaluclaude-harness

# Run the full pipeline
evaluclaude pipeline .

# Or step by step
evaluclaude intro .           # Introspect codebase
evaluclaude analyze . -o spec.json -i  # Generate spec (interactive)
evaluclaude render spec.json  # Create test files
evaluclaude run               # Execute tests

How It Works

┌─────────────────────────────────────────────────────────┐
│                    evaluclaude pipeline                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   1. INTROSPECT        Parse code with tree-sitter      │
│      📂 → 📋           Extract functions, classes       │
│                                                         │
│   2. ANALYZE           Claude generates EvalSpec        │
│      📋 → 🧠           Asks clarifying questions        │
│                                                         │
│   3. RENDER            Deterministic code generation    │
│      🧠 → 📄           pytest / vitest / jest           │
│                                                         │
│   4. RUN               Execute in sandbox               │
│      📄 → 🧪           Collect results + traces         │
│                                                         │
└─────────────────────────────────────────────────────────┘

Commands

Core Pipeline

Command	Description
`pipeline [path]`	Run the full pipeline: introspect → analyze → render → run
`intro [path]`	Introspect codebase with tree-sitter
`analyze [path]`	Generate EvalSpec with Claude
`render <spec>`	Render EvalSpec to test files
`run [test-dir]`	Execute tests and collect results

Grading & Rubrics

Command	Description
`grade <input>`	Grade output using LLM rubric
`rubrics`	List available rubrics
`calibrate`	Calibrate rubric against examples

Observability

Command	Description
`view [trace-id]`	View trace details
`traces`	List all traces
`ui`	Launch Promptfoo dashboard
`eval`	Run Promptfoo evaluations

Examples

Analyze a Python project interactively

evaluclaude analyze ./my-python-project -i -o spec.json

Claude will ask questions like:

"I see 3 database models. Which is the core domain object?"
"Found 47 utility functions. Want me to prioritize the most-used ones?"

Focus on specific modules

evaluclaude pipeline . --focus auth,payments --max-scenarios 20

View test results in browser

evaluclaude run --export-promptfoo
evaluclaude ui

Skip steps in the pipeline

# Use existing spec, just run tests
evaluclaude pipeline . --skip-analyze --skip-render

# Generate tests without running
evaluclaude pipeline . --skip-run

Configuration

Environment Variables

Variable	Description
`ANTHROPIC_API_KEY`	Your Anthropic API key

Output Structure

.evaluclaude/
├── spec.json           # Generated EvalSpec
├── traces/             # Execution traces
│   └── trace-xxx.json
├── results/            # Test results
│   └── run-xxx.json
└── promptfooconfig.yaml  # Promptfoo config (with --promptfoo)

Rubrics

Create custom grading rubrics in YAML:

# rubrics/my-rubric.yaml
name: my-rubric
description: Custom quality checks
passingThreshold: 0.7

criteria:
  - name: correctness
    description: Code produces correct results
    weight: 0.5
  - name: clarity
    description: Code is clear and readable
    weight: 0.3
  - name: efficiency
    description: Code is reasonably efficient
    weight: 0.2

Use it:

evaluclaude grade output.txt -r my-rubric

Architecture

evaluclaude follows key principles:

Tree-sitter for introspection — Never send raw code to Claude for structure extraction
Claude generates specs, not code — EvalSpec JSON is LLM output; test code is deterministic
Functional tests only — Every test must invoke actual code, no syntax checks
Full observability — Every eval run produces an inspectable trace

Supported Languages

Language	Parser	Test Framework
Python	tree-sitter-python	pytest
TypeScript	tree-sitter-typescript	vitest, jest
JavaScript	tree-sitter-typescript	vitest, jest

Development

# Build
npm run build

# Run in dev mode
npm run dev

# Run tests
npm test

# Type check
npm run typecheck

License

MIT