mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 06:04:42 +00:00
iterate
This commit is contained in:
parent
aca2126c88
commit
7c63331389
5 changed files with 520 additions and 664 deletions
342
README.md
342
README.md
|
|
@ -1,293 +1,187 @@
|
|||
# eval-skill
|
||||
|
||||
Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.
|
||||
Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.
|
||||
|
||||
## The Problem
|
||||
## Why
|
||||
|
||||
> *"How will the agent know it did the right thing?"*
|
||||
> — [Thorsten Ball](https://x.com/thorstenball)
|
||||
|
||||
Without verification, Claude implements and hopes. With verification, Claude implements and **knows**.
|
||||
Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.
|
||||
|
||||
## The Solution
|
||||
## How It Works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1. SKILL: eval │
|
||||
│ "Create evals for auth" │
|
||||
│ → Generates .claude/evals/auth.yaml │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 2. AGENT: eval-verifier │
|
||||
│ "/eval verify auth" │
|
||||
│ → Runs checks │
|
||||
│ → Collects evidence (screenshots, outputs) │
|
||||
│ → Generates executable tests │
|
||||
│ → Reports pass/fail │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 3. OUTPUT │
|
||||
│ .claude/evals/.evidence/auth/ ← Screenshots, logs │
|
||||
│ tests/generated/test_auth.py ← Executable tests │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
You: "Build auth with email/password"
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Skill: eval │
|
||||
│ Generates: │
|
||||
│ • verification spec (tests) │
|
||||
│ • building spec (what to build) │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Agent: builder │
|
||||
│ Implements from building spec │
|
||||
│ Clean context, focused on code │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Agent: verifier │
|
||||
│ Runs checks, collects evidence │
|
||||
│ Returns pass/fail │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
Pass? Done.
|
||||
Fail? → Builder fixes → Verifier checks → Loop
|
||||
```
|
||||
|
||||
Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
git clone https://github.com/yourusername/eval-skill.git
|
||||
cd eval-skill
|
||||
|
||||
# Install to current project
|
||||
./install.sh
|
||||
|
||||
# Or install globally (all projects)
|
||||
./install.sh --global
|
||||
./install.sh # Current project
|
||||
./install.sh --global # All projects
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Create Evals (Before Implementation)
|
||||
### Step 1: Create Specs
|
||||
|
||||
```
|
||||
> Create evals for user authentication
|
||||
Create evals for user authentication with email/password
|
||||
```
|
||||
|
||||
Claude generates `.claude/evals/auth.yaml`:
|
||||
Creates `.claude/evals/auth.yaml`:
|
||||
|
||||
```yaml
|
||||
name: auth
|
||||
description: Email/password authentication
|
||||
|
||||
test_output:
|
||||
framework: pytest
|
||||
path: tests/generated/
|
||||
building_spec:
|
||||
description: Email/password auth with login/signup
|
||||
requirements:
|
||||
- Password hashing with bcrypt
|
||||
- JWT tokens on login
|
||||
- /login and /signup endpoints
|
||||
|
||||
verify:
|
||||
# Deterministic
|
||||
verification_spec:
|
||||
- type: command
|
||||
run: "npm test -- --grep 'auth'"
|
||||
run: "npm test -- --grep auth"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: file-contains
|
||||
path: src/auth/password.ts
|
||||
pattern: "bcrypt|argon2"
|
||||
|
||||
# Agent-based (with evidence + test generation)
|
||||
pattern: "bcrypt"
|
||||
|
||||
- type: agent
|
||||
name: ui-login
|
||||
name: login-flow
|
||||
prompt: |
|
||||
1. Go to /login
|
||||
2. Enter test@example.com / password123
|
||||
3. Submit
|
||||
4. Verify redirect to /dashboard
|
||||
evidence:
|
||||
- screenshot: after-login
|
||||
- url: contains "/dashboard"
|
||||
1. POST /api/login with valid creds
|
||||
2. Verify JWT in response
|
||||
3. POST with wrong password
|
||||
4. Verify 401 + helpful error
|
||||
generate_test: true
|
||||
```
|
||||
|
||||
### 2. Implement
|
||||
### Step 2: Build
|
||||
|
||||
```
|
||||
> Implement auth based on .claude/evals/auth.yaml
|
||||
/eval build auth
|
||||
```
|
||||
|
||||
### 3. Verify
|
||||
Spawns builder agent → implements → spawns verifier → checks → iterates until pass.
|
||||
|
||||
```
|
||||
> /eval verify auth
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
🔍 Eval: auth
|
||||
═══════════════════════════════════════
|
||||
|
||||
Deterministic:
|
||||
✅ command: npm test (exit 0)
|
||||
✅ file-contains: bcrypt in password.ts
|
||||
|
||||
Agent:
|
||||
✅ ui-login: Dashboard redirect works
|
||||
📸 Evidence: 2 screenshots saved
|
||||
📄 Test: tests/generated/test_auth_ui_login.py
|
||||
|
||||
═══════════════════════════════════════
|
||||
📊 Results: 3/3 passed
|
||||
```
|
||||
|
||||
### 4. Run Generated Tests (Forever)
|
||||
### Step 3: Run Generated Tests (Forever)
|
||||
|
||||
```bash
|
||||
pytest tests/generated/
|
||||
```
|
||||
|
||||
The agent converted its semantic verification into deterministic tests.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Non-Deterministic → Deterministic
|
||||
|
||||
Agent checks are semantic: "verify login works." But we need proof.
|
||||
|
||||
1. **Verifier runs the check** (browser automation, API calls, file inspection)
|
||||
2. **Collects evidence** (screenshots, responses, DOM snapshots)
|
||||
3. **Generates executable test** (pytest/vitest)
|
||||
4. **Future runs use the test** (no agent needed)
|
||||
|
||||
```
|
||||
Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable)
|
||||
↓ ↓ ↓
|
||||
"Login works" screenshot + url check pytest + playwright
|
||||
```
|
||||
|
||||
### Evidence-Based Verification
|
||||
|
||||
The verifier can't just say "pass." It must provide evidence:
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: login-flow
|
||||
prompt: "Verify login redirects to dashboard"
|
||||
evidence:
|
||||
- screenshot: login-page
|
||||
- screenshot: after-submit
|
||||
- url: contains "/dashboard"
|
||||
- element: '[data-testid="welcome"]'
|
||||
```
|
||||
|
||||
Evidence is saved to `.claude/evals/.evidence/<eval>/`:
|
||||
|
||||
```json
|
||||
{
|
||||
"eval": "auth",
|
||||
"checks": [{
|
||||
"name": "login-flow",
|
||||
"pass": true,
|
||||
"evidence": [
|
||||
{"type": "screenshot", "path": "login-page.png"},
|
||||
{"type": "screenshot", "path": "after-submit.png"},
|
||||
{"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
|
||||
{"type": "element", "selector": "[data-testid=welcome]", "found": true}
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
## Check Types
|
||||
|
||||
### Deterministic (Fast, No Agent)
|
||||
|
||||
```yaml
|
||||
# Command + exit code
|
||||
- type: command
|
||||
run: "pytest tests/"
|
||||
expect: exit_code 0
|
||||
|
||||
# Command + output
|
||||
- type: command
|
||||
run: "curl localhost:3000/health"
|
||||
expect:
|
||||
contains: '"status":"ok"'
|
||||
|
||||
# File exists
|
||||
- type: file-exists
|
||||
path: src/feature.ts
|
||||
|
||||
# File contains pattern
|
||||
- type: file-contains
|
||||
path: src/auth.ts
|
||||
pattern: "bcrypt"
|
||||
|
||||
# File does NOT contain
|
||||
- type: file-not-contains
|
||||
path: .env
|
||||
pattern: "sk-"
|
||||
```
|
||||
|
||||
### Agent (Semantic, Evidence-Based)
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: descriptive-name
|
||||
prompt: |
|
||||
Step-by-step verification instructions
|
||||
evidence:
|
||||
- screenshot: step-name
|
||||
- url: contains "pattern"
|
||||
- element: "css-selector"
|
||||
- text: "expected text"
|
||||
- response: status 200
|
||||
generate_test: true # Write executable test
|
||||
```
|
||||
Agent checks become deterministic tests. First run costs tokens. Future runs are free.
|
||||
|
||||
## Commands
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| Command | What it does |
|
||||
|---------|--------------|
|
||||
| `/eval list` | List all evals |
|
||||
| `/eval show <name>` | Display eval spec |
|
||||
| `/eval verify <name>` | Run verification |
|
||||
| `/eval verify` | Run all evals |
|
||||
| `/eval evidence <name>` | Show collected evidence |
|
||||
| `/eval tests` | List generated tests |
|
||||
| `/eval clean` | Remove evidence + generated tests |
|
||||
| `/eval show <name>` | Display spec |
|
||||
| `/eval build <name>` | Build + verify loop |
|
||||
| `/eval verify <name>` | Just verify, no build |
|
||||
|
||||
## Why Context Isolation Matters
|
||||
|
||||
**Without isolation:**
|
||||
```
|
||||
Main Claude context:
|
||||
- All verification logic
|
||||
- All implementation code
|
||||
- All error history
|
||||
- Context bloat → degraded performance
|
||||
```
|
||||
|
||||
**With isolation:**
|
||||
```
|
||||
Builder context: building spec + current failure only
|
||||
Verifier context: verification spec + current code only
|
||||
Main Claude: just orchestration
|
||||
```
|
||||
|
||||
Each agent gets exactly what it needs. Nothing more.
|
||||
|
||||
## Check Types
|
||||
|
||||
**Deterministic** (fast, no agent):
|
||||
```yaml
|
||||
- type: command
|
||||
run: "npm test"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: file-contains
|
||||
path: src/auth.ts
|
||||
pattern: "bcrypt"
|
||||
```
|
||||
|
||||
**Agent** (semantic, generates tests):
|
||||
```yaml
|
||||
- type: agent
|
||||
name: ui-login
|
||||
prompt: "Navigate to /login, submit form, verify redirect"
|
||||
evidence:
|
||||
- screenshot: after-login
|
||||
- url: contains "/dashboard"
|
||||
generate_test: true
|
||||
```
|
||||
|
||||
Agent checks produce evidence (screenshots, responses) and become executable tests.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
.claude/
|
||||
├── skills/eval/SKILL.md # Eval generation skill
|
||||
├── agents/eval-verifier.md # Verification agent
|
||||
├── commands/eval.md # /eval command
|
||||
├── skills/eval/ # Generates specs
|
||||
├── agents/
|
||||
│ ├── eval-builder.md
|
||||
│ └── eval-verifier.md
|
||||
├── commands/eval.md
|
||||
└── evals/
|
||||
├── auth.yaml # Your eval specs
|
||||
├── checkout.yaml
|
||||
└── .evidence/
|
||||
├── auth/
|
||||
│ ├── evidence.json
|
||||
│ └── *.png
|
||||
└── checkout/
|
||||
└── ...
|
||||
├── auth.yaml
|
||||
└── .evidence/ # Screenshots, logs
|
||||
|
||||
tests/
|
||||
└── generated/ # Tests written by verifier
|
||||
├── test_auth_ui_login.py
|
||||
└── test_auth_api_login.py
|
||||
tests/generated/ # Tests from agent checks
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Claude Code with skills/agents/commands support
|
||||
- Claude Code
|
||||
- For UI testing: `npm install -g @anthropic/agent-browser`
|
||||
|
||||
## Philosophy
|
||||
|
||||
**TDD for Agents:**
|
||||
|
||||
| Traditional TDD | Agent TDD |
|
||||
|----------------|-----------|
|
||||
| Write tests | Write evals |
|
||||
| Write code | Claude writes code |
|
||||
| Tests pass | Claude verifies + generates tests |
|
||||
|
||||
**Why generate tests?**
|
||||
|
||||
Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.
|
||||
|
||||
**Mix deterministic and semantic:**
|
||||
|
||||
- Deterministic: "tests pass", "file exists", "command succeeds"
|
||||
- Semantic: "UI looks right", "error is helpful", "code is readable"
|
||||
|
||||
Use deterministic where possible, semantic where necessary.
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue