This commit is contained in:
Harivansh Rathi 2026-01-14 00:11:59 -08:00
parent aca2126c88
commit 7c63331389
5 changed files with 520 additions and 664 deletions

342
README.md
View file

@ -1,293 +1,187 @@
# eval-skill
Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.
Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.
## The Problem
## Why
> *"How will the agent know it did the right thing?"*
> — [Thorsten Ball](https://x.com/thorstenball)
Without verification, Claude implements and hopes. With verification, Claude implements and **knows**.
Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.
## The Solution
## How It Works
```
┌─────────────────────────────────────────────────────────────┐
│ 1. SKILL: eval │
│ "Create evals for auth" │
│ → Generates .claude/evals/auth.yaml │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. AGENT: eval-verifier │
│ "/eval verify auth" │
│ → Runs checks │
│ → Collects evidence (screenshots, outputs) │
│ → Generates executable tests │
│ → Reports pass/fail │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. OUTPUT │
│ .claude/evals/.evidence/auth/ ← Screenshots, logs │
│ tests/generated/test_auth.py ← Executable tests │
└─────────────────────────────────────────────────────────────┘
You: "Build auth with email/password"
┌─────────────────────────────────────┐
│ Skill: eval │
│ Generates: │
│ • verification spec (tests) │
│ • building spec (what to build) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Agent: builder │
│ Implements from building spec │
│ Clean context, focused on code │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Agent: verifier │
│ Runs checks, collects evidence │
│ Returns pass/fail │
└─────────────────────────────────────┘
Pass? Done.
Fail? → Builder fixes → Verifier checks → Loop
```
Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.
## Install
```bash
git clone https://github.com/yourusername/eval-skill.git
cd eval-skill
# Install to current project
./install.sh
# Or install globally (all projects)
./install.sh --global
./install.sh # Current project
./install.sh --global # All projects
```
## Usage
### 1. Create Evals (Before Implementation)
### Step 1: Create Specs
```
> Create evals for user authentication
Create evals for user authentication with email/password
```
Claude generates `.claude/evals/auth.yaml`:
Creates `.claude/evals/auth.yaml`:
```yaml
name: auth
description: Email/password authentication
test_output:
framework: pytest
path: tests/generated/
building_spec:
description: Email/password auth with login/signup
requirements:
- Password hashing with bcrypt
- JWT tokens on login
- /login and /signup endpoints
verify:
# Deterministic
verification_spec:
- type: command
run: "npm test -- --grep 'auth'"
run: "npm test -- --grep auth"
expect: exit_code 0
- type: file-contains
path: src/auth/password.ts
pattern: "bcrypt|argon2"
# Agent-based (with evidence + test generation)
pattern: "bcrypt"
- type: agent
name: ui-login
name: login-flow
prompt: |
1. Go to /login
2. Enter test@example.com / password123
3. Submit
4. Verify redirect to /dashboard
evidence:
- screenshot: after-login
- url: contains "/dashboard"
1. POST /api/login with valid creds
2. Verify JWT in response
3. POST with wrong password
4. Verify 401 + helpful error
generate_test: true
```
### 2. Implement
### Step 2: Build
```
> Implement auth based on .claude/evals/auth.yaml
/eval build auth
```
### 3. Verify
Spawns builder agent → implements → spawns verifier → checks → iterates until pass.
```
> /eval verify auth
```
Output:
```
🔍 Eval: auth
═══════════════════════════════════════
Deterministic:
✅ command: npm test (exit 0)
✅ file-contains: bcrypt in password.ts
Agent:
✅ ui-login: Dashboard redirect works
📸 Evidence: 2 screenshots saved
📄 Test: tests/generated/test_auth_ui_login.py
═══════════════════════════════════════
📊 Results: 3/3 passed
```
### 4. Run Generated Tests (Forever)
### Step 3: Run Generated Tests (Forever)
```bash
pytest tests/generated/
```
The agent converted its semantic verification into deterministic tests.
## How It Works
### Non-Deterministic → Deterministic
Agent checks are semantic: "verify login works." But we need proof.
1. **Verifier runs the check** (browser automation, API calls, file inspection)
2. **Collects evidence** (screenshots, responses, DOM snapshots)
3. **Generates executable test** (pytest/vitest)
4. **Future runs use the test** (no agent needed)
```
Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable)
↓ ↓ ↓
"Login works" screenshot + url check pytest + playwright
```
### Evidence-Based Verification
The verifier can't just say "pass." It must provide evidence:
```yaml
- type: agent
name: login-flow
prompt: "Verify login redirects to dashboard"
evidence:
- screenshot: login-page
- screenshot: after-submit
- url: contains "/dashboard"
- element: '[data-testid="welcome"]'
```
Evidence is saved to `.claude/evals/.evidence/<eval>/`:
```json
{
"eval": "auth",
"checks": [{
"name": "login-flow",
"pass": true,
"evidence": [
{"type": "screenshot", "path": "login-page.png"},
{"type": "screenshot", "path": "after-submit.png"},
{"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
{"type": "element", "selector": "[data-testid=welcome]", "found": true}
]
}]
}
```
## Check Types
### Deterministic (Fast, No Agent)
```yaml
# Command + exit code
- type: command
run: "pytest tests/"
expect: exit_code 0
# Command + output
- type: command
run: "curl localhost:3000/health"
expect:
contains: '"status":"ok"'
# File exists
- type: file-exists
path: src/feature.ts
# File contains pattern
- type: file-contains
path: src/auth.ts
pattern: "bcrypt"
# File does NOT contain
- type: file-not-contains
path: .env
pattern: "sk-"
```
### Agent (Semantic, Evidence-Based)
```yaml
- type: agent
name: descriptive-name
prompt: |
Step-by-step verification instructions
evidence:
- screenshot: step-name
- url: contains "pattern"
- element: "css-selector"
- text: "expected text"
- response: status 200
generate_test: true # Write executable test
```
Agent checks become deterministic tests. First run costs tokens. Future runs are free.
## Commands
| Command | Description |
|---------|-------------|
| Command | What it does |
|---------|--------------|
| `/eval list` | List all evals |
| `/eval show <name>` | Display eval spec |
| `/eval verify <name>` | Run verification |
| `/eval verify` | Run all evals |
| `/eval evidence <name>` | Show collected evidence |
| `/eval tests` | List generated tests |
| `/eval clean` | Remove evidence + generated tests |
| `/eval show <name>` | Display spec |
| `/eval build <name>` | Build + verify loop |
| `/eval verify <name>` | Just verify, no build |
## Why Context Isolation Matters
**Without isolation:**
```
Main Claude context:
- All verification logic
- All implementation code
- All error history
- Context bloat → degraded performance
```
**With isolation:**
```
Builder context: building spec + current failure only
Verifier context: verification spec + current code only
Main Claude: just orchestration
```
Each agent gets exactly what it needs. Nothing more.
## Check Types
**Deterministic** (fast, no agent):
```yaml
- type: command
run: "npm test"
expect: exit_code 0
- type: file-contains
path: src/auth.ts
pattern: "bcrypt"
```
**Agent** (semantic, generates tests):
```yaml
- type: agent
name: ui-login
prompt: "Navigate to /login, submit form, verify redirect"
evidence:
- screenshot: after-login
- url: contains "/dashboard"
generate_test: true
```
Agent checks produce evidence (screenshots, responses) and become executable tests.
## Directory Structure
```
.claude/
├── skills/eval/SKILL.md # Eval generation skill
├── agents/eval-verifier.md # Verification agent
├── commands/eval.md # /eval command
├── skills/eval/ # Generates specs
├── agents/
│ ├── eval-builder.md
│ └── eval-verifier.md
├── commands/eval.md
└── evals/
├── auth.yaml # Your eval specs
├── checkout.yaml
└── .evidence/
├── auth/
│ ├── evidence.json
│ └── *.png
└── checkout/
└── ...
├── auth.yaml
└── .evidence/ # Screenshots, logs
tests/
└── generated/ # Tests written by verifier
├── test_auth_ui_login.py
└── test_auth_api_login.py
tests/generated/ # Tests from agent checks
```
## Requirements
- Claude Code with skills/agents/commands support
- Claude Code
- For UI testing: `npm install -g @anthropic/agent-browser`
## Philosophy
**TDD for Agents:**
| Traditional TDD | Agent TDD |
|----------------|-----------|
| Write tests | Write evals |
| Write code | Claude writes code |
| Tests pass | Claude verifies + generates tests |
**Why generate tests?**
Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.
**Mix deterministic and semantic:**
- Deterministic: "tests pass", "file exists", "command succeeds"
- Semantic: "UI looks right", "error is helpful", "code is readable"
Use deterministic where possible, semantic where necessary.
## License
MIT