From 7062b5334469044c1cb3397118a8ec83b74bd7d9 Mon Sep 17 00:00:00 2001 From: Harivansh Rathi Date: Sun, 11 Jan 2026 20:56:41 -0500 Subject: [PATCH] readme --- README.md | 216 ++++++++++++++---------------------------------------- 1 file changed, 54 insertions(+), 162 deletions(-) diff --git a/README.md b/README.md index f251ac7..7d45008 100644 --- a/README.md +++ b/README.md @@ -6,195 +6,87 @@ ![Node](https://img.shields.io/badge/node-%3E%3D18.0.0-green) ![License](https://img.shields.io/badge/license-MIT-brightgreen) -## What is this? - -**evaluclaude** is a CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Unlike traditional test generators that produce boilerplate, evaluclaude: - -- **Parses your code** with tree-sitter (no LLM tokens wasted on structure) -- **Asks smart questions** to understand your testing priorities -- **Generates specs, not code** — deterministic renderers create the actual tests -- **Full observability** — every run produces a trace you can inspect +A CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Tree-sitter parses your code structure, Claude generates test specs, and deterministic renderers create the actual tests. ## Quick Start ```bash -# Install npm install -g evaluclaude-harness +export ANTHROPIC_API_KEY=your-key # Run the full pipeline evaluclaude pipeline . # Or step by step -evaluclaude intro . # Introspect codebase +evaluclaude intro . # Parse codebase evaluclaude analyze . -o spec.json -i # Generate spec (interactive) -evaluclaude render spec.json # Create test files -evaluclaude run # Execute tests +evaluclaude render spec.json # Create test files +evaluclaude run # Execute tests ``` ## How It Works ``` -┌─────────────────────────────────────────────────────────┐ -│ evaluclaude pipeline │ -├─────────────────────────────────────────────────────────┤ -│ │ -│ 1. INTROSPECT Parse code with tree-sitter │ -│ 📂 → 📋 Extract functions, classes │ -│ │ -│ 2. ANALYZE Claude generates EvalSpec │ -│ 📋 → 🧠 Asks clarifying questions │ -│ │ -│ 3. RENDER Deterministic code generation │ -│ 🧠 → 📄 pytest / vitest / jest │ -│ │ -│ 4. RUN Execute in sandbox │ -│ 📄 → 🧪 Collect results + traces │ -│ │ -└─────────────────────────────────────────────────────────┘ + INTROSPECT ANALYZE RENDER RUN + Parse code -> Generate -> Create test -> Execute + with tree-sitter EvalSpec files (pytest, & trace + with Claude vitest, jest) ``` ## Commands -### Core Pipeline - -| Command | Description | -|---------|-------------| -| `pipeline [path]` | Run the full pipeline: introspect → analyze → render → run | -| `intro [path]` | Introspect codebase with tree-sitter | -| `analyze [path]` | Generate EvalSpec with Claude | -| `render ` | Render EvalSpec to test files | -| `run [test-dir]` | Execute tests and collect results | - -### Grading & Rubrics - -| Command | Description | -|---------|-------------| -| `grade ` | Grade output using LLM rubric | -| `rubrics` | List available rubrics | -| `calibrate` | Calibrate rubric against examples | - -### Observability - -| Command | Description | -|---------|-------------| -| `view [trace-id]` | View trace details | -| `traces` | List all traces | -| `ui` | Launch Promptfoo dashboard | -| `eval` | Run Promptfoo evaluations | - -## Examples - -### Analyze a Python project interactively - -```bash -evaluclaude analyze ./my-python-project -i -o spec.json -``` - -Claude will ask questions like: -- "I see 3 database models. Which is the core domain object?" -- "Found 47 utility functions. Want me to prioritize the most-used ones?" - -### Focus on specific modules - -```bash -evaluclaude pipeline . --focus auth,payments --max-scenarios 20 -``` - -### View test results in browser - -```bash -evaluclaude run --export-promptfoo -evaluclaude ui -``` - -### Skip steps in the pipeline - -```bash -# Use existing spec, just run tests -evaluclaude pipeline . --skip-analyze --skip-render - -# Generate tests without running -evaluclaude pipeline . --skip-run -``` - -## Configuration - -### Environment Variables - -| Variable | Description | -|----------|-------------| -| `ANTHROPIC_API_KEY` | Your Anthropic API key | - -### Output Structure - -``` -.evaluclaude/ -├── spec.json # Generated EvalSpec -├── traces/ # Execution traces -│ └── trace-xxx.json -├── results/ # Test results -│ └── run-xxx.json -└── promptfooconfig.yaml # Promptfoo config (with --promptfoo) -``` - -## Rubrics - -Create custom grading rubrics in YAML: - -```yaml -# rubrics/my-rubric.yaml -name: my-rubric -description: Custom quality checks -passingThreshold: 0.7 - -criteria: - - name: correctness - description: Code produces correct results - weight: 0.5 - - name: clarity - description: Code is clear and readable - weight: 0.3 - - name: efficiency - description: Code is reasonably efficient - weight: 0.2 -``` - -Use it: -```bash -evaluclaude grade output.txt -r my-rubric -``` - -## Architecture - -evaluclaude follows key principles: - -1. **Tree-sitter for introspection** — Never send raw code to Claude for structure extraction -2. **Claude generates specs, not code** — EvalSpec JSON is LLM output; test code is deterministic -3. **Functional tests only** — Every test must invoke actual code, no syntax checks -4. **Full observability** — Every eval run produces an inspectable trace +| Command | Description | +| :------------------------- | :-------------------------------------------- | +| `pipeline [path]` | Run full pipeline: intro -> analyze -> render -> run | +| `intro [path]` | Introspect codebase with tree-sitter | +| `analyze [path]` | Generate EvalSpec with Claude | +| `render ` | Render EvalSpec to test files | +| `run [test-dir]` | Execute tests and collect results | +| `grade ` | Grade output using LLM rubric | +| `rubrics` | List available rubrics | +| `calibrate` | Calibrate rubric against examples | +| `view [trace-id]` | View trace details | +| `traces` | List all traces | +| `ui` | Launch Promptfoo dashboard | +| `eval` | Run Promptfoo evaluations | ## Supported Languages -| Language | Parser | Test Framework | -|----------|--------|----------------| -| Python | tree-sitter-python | pytest | -| TypeScript | tree-sitter-typescript | vitest, jest | -| JavaScript | tree-sitter-typescript | vitest, jest | +| Language | Parser | Test Framework | +| :------------ | :----------------------- | :-------------- | +| Python | tree-sitter-python | pytest | +| TypeScript | tree-sitter-typescript | vitest, jest | +| JavaScript | tree-sitter-typescript | vitest, jest | + +## Output Structure + +``` +.evaluclaude/ + spec.json # Generated EvalSpec + traces/ # Execution traces + results/ # Test results + promptfooconfig.yaml # Promptfoo config +``` + +## How This Was Built + +This project was built in a few hours using [Amp Code](https://ampcode.com). You can explore the development threads: + +- [Initial setup and CLI structure](https://ampcode.com/threads/T-019bae58-69c2-74d0-a975-4be84c7d98dc) +- [Introspection and tree-sitter integration](https://ampcode.com/threads/T-019bafc5-0d57-702a-a407-44b9b884b9d0) +- [EvalSpec analysis with Claude](https://ampcode.com/threads/T-019baf57-7bc1-7368-9e3d-d649da47b68b) +- [Test rendering and framework support](https://ampcode.com/threads/T-019baeef-6079-70d6-9c4a-3cfd439190f1) +- [Grading and rubrics system](https://ampcode.com/threads/T-019baf12-e566-733d-b086-d099880c77c1) +- [Promptfoo integration](https://ampcode.com/threads/T-019baf43-abcf-7715-8a65-0f5ac5df87ce) +- [UI polish and final touches](https://ampcode.com/threads/T-019baf63-8c9e-7018-b8bc-538c5a3cada7) ## Development ```bash -# Build -npm run build - -# Run in dev mode -npm run dev - -# Run tests -npm test - -# Type check -npm run typecheck +npm run build # Build +npm run dev # Dev mode +npm test # Run tests +npm run typecheck # Type check ``` ## License