mirror of
https://github.com/harivansh-afk/evaluclaude-harness.git
synced 2026-04-15 08:03:43 +00:00
readme
This commit is contained in:
parent
69c08c9d6b
commit
7062b53344
1 changed files with 54 additions and 162 deletions
216
README.md
216
README.md
|
|
@ -6,195 +6,87 @@
|
|||

|
||||

|
||||
|
||||
## What is this?
|
||||
|
||||
**evaluclaude** is a CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Unlike traditional test generators that produce boilerplate, evaluclaude:
|
||||
|
||||
- **Parses your code** with tree-sitter (no LLM tokens wasted on structure)
|
||||
- **Asks smart questions** to understand your testing priorities
|
||||
- **Generates specs, not code** — deterministic renderers create the actual tests
|
||||
- **Full observability** — every run produces a trace you can inspect
|
||||
A CLI tool that uses Claude to understand your codebase and generate real, runnable functional tests. Tree-sitter parses your code structure, Claude generates test specs, and deterministic renderers create the actual tests.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Install
|
||||
npm install -g evaluclaude-harness
|
||||
export ANTHROPIC_API_KEY=your-key
|
||||
|
||||
# Run the full pipeline
|
||||
evaluclaude pipeline .
|
||||
|
||||
# Or step by step
|
||||
evaluclaude intro . # Introspect codebase
|
||||
evaluclaude intro . # Parse codebase
|
||||
evaluclaude analyze . -o spec.json -i # Generate spec (interactive)
|
||||
evaluclaude render spec.json # Create test files
|
||||
evaluclaude run # Execute tests
|
||||
evaluclaude render spec.json # Create test files
|
||||
evaluclaude run # Execute tests
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ evaluclaude pipeline │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. INTROSPECT Parse code with tree-sitter │
|
||||
│ 📂 → 📋 Extract functions, classes │
|
||||
│ │
|
||||
│ 2. ANALYZE Claude generates EvalSpec │
|
||||
│ 📋 → 🧠 Asks clarifying questions │
|
||||
│ │
|
||||
│ 3. RENDER Deterministic code generation │
|
||||
│ 🧠 → 📄 pytest / vitest / jest │
|
||||
│ │
|
||||
│ 4. RUN Execute in sandbox │
|
||||
│ 📄 → 🧪 Collect results + traces │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
INTROSPECT ANALYZE RENDER RUN
|
||||
Parse code -> Generate -> Create test -> Execute
|
||||
with tree-sitter EvalSpec files (pytest, & trace
|
||||
with Claude vitest, jest)
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
### Core Pipeline
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `pipeline [path]` | Run the full pipeline: introspect → analyze → render → run |
|
||||
| `intro [path]` | Introspect codebase with tree-sitter |
|
||||
| `analyze [path]` | Generate EvalSpec with Claude |
|
||||
| `render <spec>` | Render EvalSpec to test files |
|
||||
| `run [test-dir]` | Execute tests and collect results |
|
||||
|
||||
### Grading & Rubrics
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `grade <input>` | Grade output using LLM rubric |
|
||||
| `rubrics` | List available rubrics |
|
||||
| `calibrate` | Calibrate rubric against examples |
|
||||
|
||||
### Observability
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `view [trace-id]` | View trace details |
|
||||
| `traces` | List all traces |
|
||||
| `ui` | Launch Promptfoo dashboard |
|
||||
| `eval` | Run Promptfoo evaluations |
|
||||
|
||||
## Examples
|
||||
|
||||
### Analyze a Python project interactively
|
||||
|
||||
```bash
|
||||
evaluclaude analyze ./my-python-project -i -o spec.json
|
||||
```
|
||||
|
||||
Claude will ask questions like:
|
||||
- "I see 3 database models. Which is the core domain object?"
|
||||
- "Found 47 utility functions. Want me to prioritize the most-used ones?"
|
||||
|
||||
### Focus on specific modules
|
||||
|
||||
```bash
|
||||
evaluclaude pipeline . --focus auth,payments --max-scenarios 20
|
||||
```
|
||||
|
||||
### View test results in browser
|
||||
|
||||
```bash
|
||||
evaluclaude run --export-promptfoo
|
||||
evaluclaude ui
|
||||
```
|
||||
|
||||
### Skip steps in the pipeline
|
||||
|
||||
```bash
|
||||
# Use existing spec, just run tests
|
||||
evaluclaude pipeline . --skip-analyze --skip-render
|
||||
|
||||
# Generate tests without running
|
||||
evaluclaude pipeline . --skip-run
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `ANTHROPIC_API_KEY` | Your Anthropic API key |
|
||||
|
||||
### Output Structure
|
||||
|
||||
```
|
||||
.evaluclaude/
|
||||
├── spec.json # Generated EvalSpec
|
||||
├── traces/ # Execution traces
|
||||
│ └── trace-xxx.json
|
||||
├── results/ # Test results
|
||||
│ └── run-xxx.json
|
||||
└── promptfooconfig.yaml # Promptfoo config (with --promptfoo)
|
||||
```
|
||||
|
||||
## Rubrics
|
||||
|
||||
Create custom grading rubrics in YAML:
|
||||
|
||||
```yaml
|
||||
# rubrics/my-rubric.yaml
|
||||
name: my-rubric
|
||||
description: Custom quality checks
|
||||
passingThreshold: 0.7
|
||||
|
||||
criteria:
|
||||
- name: correctness
|
||||
description: Code produces correct results
|
||||
weight: 0.5
|
||||
- name: clarity
|
||||
description: Code is clear and readable
|
||||
weight: 0.3
|
||||
- name: efficiency
|
||||
description: Code is reasonably efficient
|
||||
weight: 0.2
|
||||
```
|
||||
|
||||
Use it:
|
||||
```bash
|
||||
evaluclaude grade output.txt -r my-rubric
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
evaluclaude follows key principles:
|
||||
|
||||
1. **Tree-sitter for introspection** — Never send raw code to Claude for structure extraction
|
||||
2. **Claude generates specs, not code** — EvalSpec JSON is LLM output; test code is deterministic
|
||||
3. **Functional tests only** — Every test must invoke actual code, no syntax checks
|
||||
4. **Full observability** — Every eval run produces an inspectable trace
|
||||
| Command | Description |
|
||||
| :------------------------- | :-------------------------------------------- |
|
||||
| `pipeline [path]` | Run full pipeline: intro -> analyze -> render -> run |
|
||||
| `intro [path]` | Introspect codebase with tree-sitter |
|
||||
| `analyze [path]` | Generate EvalSpec with Claude |
|
||||
| `render <spec>` | Render EvalSpec to test files |
|
||||
| `run [test-dir]` | Execute tests and collect results |
|
||||
| `grade <input>` | Grade output using LLM rubric |
|
||||
| `rubrics` | List available rubrics |
|
||||
| `calibrate` | Calibrate rubric against examples |
|
||||
| `view [trace-id]` | View trace details |
|
||||
| `traces` | List all traces |
|
||||
| `ui` | Launch Promptfoo dashboard |
|
||||
| `eval` | Run Promptfoo evaluations |
|
||||
|
||||
## Supported Languages
|
||||
|
||||
| Language | Parser | Test Framework |
|
||||
|----------|--------|----------------|
|
||||
| Python | tree-sitter-python | pytest |
|
||||
| TypeScript | tree-sitter-typescript | vitest, jest |
|
||||
| JavaScript | tree-sitter-typescript | vitest, jest |
|
||||
| Language | Parser | Test Framework |
|
||||
| :------------ | :----------------------- | :-------------- |
|
||||
| Python | tree-sitter-python | pytest |
|
||||
| TypeScript | tree-sitter-typescript | vitest, jest |
|
||||
| JavaScript | tree-sitter-typescript | vitest, jest |
|
||||
|
||||
## Output Structure
|
||||
|
||||
```
|
||||
.evaluclaude/
|
||||
spec.json # Generated EvalSpec
|
||||
traces/ # Execution traces
|
||||
results/ # Test results
|
||||
promptfooconfig.yaml # Promptfoo config
|
||||
```
|
||||
|
||||
## How This Was Built
|
||||
|
||||
This project was built in a few hours using [Amp Code](https://ampcode.com). You can explore the development threads:
|
||||
|
||||
- [Initial setup and CLI structure](https://ampcode.com/threads/T-019bae58-69c2-74d0-a975-4be84c7d98dc)
|
||||
- [Introspection and tree-sitter integration](https://ampcode.com/threads/T-019bafc5-0d57-702a-a407-44b9b884b9d0)
|
||||
- [EvalSpec analysis with Claude](https://ampcode.com/threads/T-019baf57-7bc1-7368-9e3d-d649da47b68b)
|
||||
- [Test rendering and framework support](https://ampcode.com/threads/T-019baeef-6079-70d6-9c4a-3cfd439190f1)
|
||||
- [Grading and rubrics system](https://ampcode.com/threads/T-019baf12-e566-733d-b086-d099880c77c1)
|
||||
- [Promptfoo integration](https://ampcode.com/threads/T-019baf43-abcf-7715-8a65-0f5ac5df87ce)
|
||||
- [UI polish and final touches](https://ampcode.com/threads/T-019baf63-8c9e-7018-b8bc-538c5a3cada7)
|
||||
|
||||
## Development
|
||||
|
||||
```bash
|
||||
# Build
|
||||
npm run build
|
||||
|
||||
# Run in dev mode
|
||||
npm run dev
|
||||
|
||||
# Run tests
|
||||
npm test
|
||||
|
||||
# Type check
|
||||
npm run typecheck
|
||||
npm run build # Build
|
||||
npm run dev # Dev mode
|
||||
npm test # Run tests
|
||||
npm run typecheck # Type check
|
||||
```
|
||||
|
||||
## License
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue