mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 07:04:46 +00:00
iterate
This commit is contained in:
parent
aca2126c88
commit
7c63331389
5 changed files with 520 additions and 664 deletions
354
README.md
354
README.md
|
|
@ -1,293 +1,187 @@
|
||||||
# eval-skill
|
# eval-skill
|
||||||
|
|
||||||
Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.
|
Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.
|
||||||
|
|
||||||
## The Problem
|
## Why
|
||||||
|
|
||||||
> *"How will the agent know it did the right thing?"*
|
> *"How will the agent know it did the right thing?"*
|
||||||
> — [Thorsten Ball](https://x.com/thorstenball)
|
|
||||||
|
|
||||||
Without verification, Claude implements and hopes. With verification, Claude implements and **knows**.
|
Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.
|
||||||
|
|
||||||
## The Solution
|
## How It Works
|
||||||
|
|
||||||
```
|
```
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
You: "Build auth with email/password"
|
||||||
│ 1. SKILL: eval │
|
|
||||||
│ "Create evals for auth" │
|
|
||||||
│ → Generates .claude/evals/auth.yaml │
|
|
||||||
└─────────────────────────────────────────────────────────────┘
|
|
||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────┐
|
||||||
│ 2. AGENT: eval-verifier │
|
│ Skill: eval │
|
||||||
│ "/eval verify auth" │
|
│ Generates: │
|
||||||
│ → Runs checks │
|
│ • verification spec (tests) │
|
||||||
│ → Collects evidence (screenshots, outputs) │
|
│ • building spec (what to build) │
|
||||||
│ → Generates executable tests │
|
└─────────────────────────────────────┘
|
||||||
│ → Reports pass/fail │
|
|
||||||
└─────────────────────────────────────────────────────────────┘
|
|
||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────┐
|
||||||
│ 3. OUTPUT │
|
│ Agent: builder │
|
||||||
│ .claude/evals/.evidence/auth/ ← Screenshots, logs │
|
│ Implements from building spec │
|
||||||
│ tests/generated/test_auth.py ← Executable tests │
|
│ Clean context, focused on code │
|
||||||
└─────────────────────────────────────────────────────────────┘
|
└─────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ Agent: verifier │
|
||||||
|
│ Runs checks, collects evidence │
|
||||||
|
│ Returns pass/fail │
|
||||||
|
└─────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Pass? Done.
|
||||||
|
Fail? → Builder fixes → Verifier checks → Loop
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/yourusername/eval-skill.git
|
git clone https://github.com/yourusername/eval-skill.git
|
||||||
cd eval-skill
|
cd eval-skill
|
||||||
|
./install.sh # Current project
|
||||||
# Install to current project
|
./install.sh --global # All projects
|
||||||
./install.sh
|
|
||||||
|
|
||||||
# Or install globally (all projects)
|
|
||||||
./install.sh --global
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### 1. Create Evals (Before Implementation)
|
### Step 1: Create Specs
|
||||||
|
|
||||||
```
|
```
|
||||||
> Create evals for user authentication
|
Create evals for user authentication with email/password
|
||||||
```
|
```
|
||||||
|
|
||||||
Claude generates `.claude/evals/auth.yaml`:
|
Creates `.claude/evals/auth.yaml`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
name: auth
|
name: auth
|
||||||
description: Email/password authentication
|
|
||||||
|
|
||||||
test_output:
|
building_spec:
|
||||||
framework: pytest
|
description: Email/password auth with login/signup
|
||||||
path: tests/generated/
|
requirements:
|
||||||
|
- Password hashing with bcrypt
|
||||||
|
- JWT tokens on login
|
||||||
|
- /login and /signup endpoints
|
||||||
|
|
||||||
verify:
|
verification_spec:
|
||||||
# Deterministic
|
|
||||||
- type: command
|
- type: command
|
||||||
run: "npm test -- --grep 'auth'"
|
run: "npm test -- --grep auth"
|
||||||
expect: exit_code 0
|
expect: exit_code 0
|
||||||
|
|
||||||
- type: file-contains
|
- type: file-contains
|
||||||
path: src/auth/password.ts
|
path: src/auth/password.ts
|
||||||
pattern: "bcrypt|argon2"
|
pattern: "bcrypt"
|
||||||
|
|
||||||
# Agent-based (with evidence + test generation)
|
|
||||||
- type: agent
|
- type: agent
|
||||||
name: ui-login
|
name: login-flow
|
||||||
prompt: |
|
prompt: |
|
||||||
1. Go to /login
|
1. POST /api/login with valid creds
|
||||||
2. Enter test@example.com / password123
|
2. Verify JWT in response
|
||||||
3. Submit
|
3. POST with wrong password
|
||||||
4. Verify redirect to /dashboard
|
4. Verify 401 + helpful error
|
||||||
|
generate_test: true
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Build
|
||||||
|
|
||||||
|
```
|
||||||
|
/eval build auth
|
||||||
|
```
|
||||||
|
|
||||||
|
Spawns builder agent → implements → spawns verifier → checks → iterates until pass.
|
||||||
|
|
||||||
|
### Step 3: Run Generated Tests (Forever)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pytest tests/generated/
|
||||||
|
```
|
||||||
|
|
||||||
|
Agent checks become deterministic tests. First run costs tokens. Future runs are free.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
| Command | What it does |
|
||||||
|
|---------|--------------|
|
||||||
|
| `/eval list` | List all evals |
|
||||||
|
| `/eval show <name>` | Display spec |
|
||||||
|
| `/eval build <name>` | Build + verify loop |
|
||||||
|
| `/eval verify <name>` | Just verify, no build |
|
||||||
|
|
||||||
|
## Why Context Isolation Matters
|
||||||
|
|
||||||
|
**Without isolation:**
|
||||||
|
```
|
||||||
|
Main Claude context:
|
||||||
|
- All verification logic
|
||||||
|
- All implementation code
|
||||||
|
- All error history
|
||||||
|
- Context bloat → degraded performance
|
||||||
|
```
|
||||||
|
|
||||||
|
**With isolation:**
|
||||||
|
```
|
||||||
|
Builder context: building spec + current failure only
|
||||||
|
Verifier context: verification spec + current code only
|
||||||
|
Main Claude: just orchestration
|
||||||
|
```
|
||||||
|
|
||||||
|
Each agent gets exactly what it needs. Nothing more.
|
||||||
|
|
||||||
|
## Check Types
|
||||||
|
|
||||||
|
**Deterministic** (fast, no agent):
|
||||||
|
```yaml
|
||||||
|
- type: command
|
||||||
|
run: "npm test"
|
||||||
|
expect: exit_code 0
|
||||||
|
|
||||||
|
- type: file-contains
|
||||||
|
path: src/auth.ts
|
||||||
|
pattern: "bcrypt"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Agent** (semantic, generates tests):
|
||||||
|
```yaml
|
||||||
|
- type: agent
|
||||||
|
name: ui-login
|
||||||
|
prompt: "Navigate to /login, submit form, verify redirect"
|
||||||
evidence:
|
evidence:
|
||||||
- screenshot: after-login
|
- screenshot: after-login
|
||||||
- url: contains "/dashboard"
|
- url: contains "/dashboard"
|
||||||
generate_test: true
|
generate_test: true
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Implement
|
Agent checks produce evidence (screenshots, responses) and become executable tests.
|
||||||
|
|
||||||
```
|
|
||||||
> Implement auth based on .claude/evals/auth.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Verify
|
|
||||||
|
|
||||||
```
|
|
||||||
> /eval verify auth
|
|
||||||
```
|
|
||||||
|
|
||||||
Output:
|
|
||||||
|
|
||||||
```
|
|
||||||
🔍 Eval: auth
|
|
||||||
═══════════════════════════════════════
|
|
||||||
|
|
||||||
Deterministic:
|
|
||||||
✅ command: npm test (exit 0)
|
|
||||||
✅ file-contains: bcrypt in password.ts
|
|
||||||
|
|
||||||
Agent:
|
|
||||||
✅ ui-login: Dashboard redirect works
|
|
||||||
📸 Evidence: 2 screenshots saved
|
|
||||||
📄 Test: tests/generated/test_auth_ui_login.py
|
|
||||||
|
|
||||||
═══════════════════════════════════════
|
|
||||||
📊 Results: 3/3 passed
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Run Generated Tests (Forever)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pytest tests/generated/
|
|
||||||
```
|
|
||||||
|
|
||||||
The agent converted its semantic verification into deterministic tests.
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
### Non-Deterministic → Deterministic
|
|
||||||
|
|
||||||
Agent checks are semantic: "verify login works." But we need proof.
|
|
||||||
|
|
||||||
1. **Verifier runs the check** (browser automation, API calls, file inspection)
|
|
||||||
2. **Collects evidence** (screenshots, responses, DOM snapshots)
|
|
||||||
3. **Generates executable test** (pytest/vitest)
|
|
||||||
4. **Future runs use the test** (no agent needed)
|
|
||||||
|
|
||||||
```
|
|
||||||
Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable)
|
|
||||||
↓ ↓ ↓
|
|
||||||
"Login works" screenshot + url check pytest + playwright
|
|
||||||
```
|
|
||||||
|
|
||||||
### Evidence-Based Verification
|
|
||||||
|
|
||||||
The verifier can't just say "pass." It must provide evidence:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
- type: agent
|
|
||||||
name: login-flow
|
|
||||||
prompt: "Verify login redirects to dashboard"
|
|
||||||
evidence:
|
|
||||||
- screenshot: login-page
|
|
||||||
- screenshot: after-submit
|
|
||||||
- url: contains "/dashboard"
|
|
||||||
- element: '[data-testid="welcome"]'
|
|
||||||
```
|
|
||||||
|
|
||||||
Evidence is saved to `.claude/evals/.evidence/<eval>/`:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"eval": "auth",
|
|
||||||
"checks": [{
|
|
||||||
"name": "login-flow",
|
|
||||||
"pass": true,
|
|
||||||
"evidence": [
|
|
||||||
{"type": "screenshot", "path": "login-page.png"},
|
|
||||||
{"type": "screenshot", "path": "after-submit.png"},
|
|
||||||
{"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
|
|
||||||
{"type": "element", "selector": "[data-testid=welcome]", "found": true}
|
|
||||||
]
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Check Types
|
|
||||||
|
|
||||||
### Deterministic (Fast, No Agent)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# Command + exit code
|
|
||||||
- type: command
|
|
||||||
run: "pytest tests/"
|
|
||||||
expect: exit_code 0
|
|
||||||
|
|
||||||
# Command + output
|
|
||||||
- type: command
|
|
||||||
run: "curl localhost:3000/health"
|
|
||||||
expect:
|
|
||||||
contains: '"status":"ok"'
|
|
||||||
|
|
||||||
# File exists
|
|
||||||
- type: file-exists
|
|
||||||
path: src/feature.ts
|
|
||||||
|
|
||||||
# File contains pattern
|
|
||||||
- type: file-contains
|
|
||||||
path: src/auth.ts
|
|
||||||
pattern: "bcrypt"
|
|
||||||
|
|
||||||
# File does NOT contain
|
|
||||||
- type: file-not-contains
|
|
||||||
path: .env
|
|
||||||
pattern: "sk-"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Agent (Semantic, Evidence-Based)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
- type: agent
|
|
||||||
name: descriptive-name
|
|
||||||
prompt: |
|
|
||||||
Step-by-step verification instructions
|
|
||||||
evidence:
|
|
||||||
- screenshot: step-name
|
|
||||||
- url: contains "pattern"
|
|
||||||
- element: "css-selector"
|
|
||||||
- text: "expected text"
|
|
||||||
- response: status 200
|
|
||||||
generate_test: true # Write executable test
|
|
||||||
```
|
|
||||||
|
|
||||||
## Commands
|
|
||||||
|
|
||||||
| Command | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `/eval list` | List all evals |
|
|
||||||
| `/eval show <name>` | Display eval spec |
|
|
||||||
| `/eval verify <name>` | Run verification |
|
|
||||||
| `/eval verify` | Run all evals |
|
|
||||||
| `/eval evidence <name>` | Show collected evidence |
|
|
||||||
| `/eval tests` | List generated tests |
|
|
||||||
| `/eval clean` | Remove evidence + generated tests |
|
|
||||||
|
|
||||||
## Directory Structure
|
## Directory Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
.claude/
|
.claude/
|
||||||
├── skills/eval/SKILL.md # Eval generation skill
|
├── skills/eval/ # Generates specs
|
||||||
├── agents/eval-verifier.md # Verification agent
|
├── agents/
|
||||||
├── commands/eval.md # /eval command
|
│ ├── eval-builder.md
|
||||||
|
│ └── eval-verifier.md
|
||||||
|
├── commands/eval.md
|
||||||
└── evals/
|
└── evals/
|
||||||
├── auth.yaml # Your eval specs
|
├── auth.yaml
|
||||||
├── checkout.yaml
|
└── .evidence/ # Screenshots, logs
|
||||||
└── .evidence/
|
|
||||||
├── auth/
|
|
||||||
│ ├── evidence.json
|
|
||||||
│ └── *.png
|
|
||||||
└── checkout/
|
|
||||||
└── ...
|
|
||||||
|
|
||||||
tests/
|
tests/generated/ # Tests from agent checks
|
||||||
└── generated/ # Tests written by verifier
|
|
||||||
├── test_auth_ui_login.py
|
|
||||||
└── test_auth_api_login.py
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
- Claude Code with skills/agents/commands support
|
- Claude Code
|
||||||
- For UI testing: `npm install -g @anthropic/agent-browser`
|
- For UI testing: `npm install -g @anthropic/agent-browser`
|
||||||
|
|
||||||
## Philosophy
|
|
||||||
|
|
||||||
**TDD for Agents:**
|
|
||||||
|
|
||||||
| Traditional TDD | Agent TDD |
|
|
||||||
|----------------|-----------|
|
|
||||||
| Write tests | Write evals |
|
|
||||||
| Write code | Claude writes code |
|
|
||||||
| Tests pass | Claude verifies + generates tests |
|
|
||||||
|
|
||||||
**Why generate tests?**
|
|
||||||
|
|
||||||
Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.
|
|
||||||
|
|
||||||
**Mix deterministic and semantic:**
|
|
||||||
|
|
||||||
- Deterministic: "tests pass", "file exists", "command succeeds"
|
|
||||||
- Semantic: "UI looks right", "error is helpful", "code is readable"
|
|
||||||
|
|
||||||
Use deterministic where possible, semantic where necessary.
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
MIT
|
MIT
|
||||||
|
|
|
||||||
111
agents/eval-builder.md
Normal file
111
agents/eval-builder.md
Normal file
|
|
@ -0,0 +1,111 @@
|
||||||
|
---
|
||||||
|
name: eval-builder
|
||||||
|
description: Implementation agent that builds features from building specs. Use when running /eval build.
|
||||||
|
tools: Read, Write, Edit, Bash, Grep, Glob
|
||||||
|
model: sonnet
|
||||||
|
permissionMode: acceptEdits
|
||||||
|
---
|
||||||
|
|
||||||
|
# Eval Builder Agent
|
||||||
|
|
||||||
|
I implement features based on building specs. I don't verify — that's the verifier's job.
|
||||||
|
|
||||||
|
## My Responsibilities
|
||||||
|
|
||||||
|
1. Read the building spec from eval YAML
|
||||||
|
2. Implement the requirements
|
||||||
|
3. Write clean, working code
|
||||||
|
4. Report what I built
|
||||||
|
|
||||||
|
## What I Do NOT Do
|
||||||
|
|
||||||
|
- Run verification checks (verifier does this)
|
||||||
|
- Collect evidence (verifier does this)
|
||||||
|
- Generate tests (verifier does this)
|
||||||
|
- Decide if my work is correct (verifier does this)
|
||||||
|
|
||||||
|
## Input
|
||||||
|
|
||||||
|
I receive:
|
||||||
|
1. **Eval spec path**: `.claude/evals/<name>.yaml`
|
||||||
|
2. **Failure context** (if retrying): What failed and why
|
||||||
|
|
||||||
|
## Process
|
||||||
|
|
||||||
|
### First Run
|
||||||
|
|
||||||
|
1. Read the eval spec
|
||||||
|
2. Extract `building_spec` section
|
||||||
|
3. Understand requirements
|
||||||
|
4. Implement the feature
|
||||||
|
5. Report files created/modified
|
||||||
|
|
||||||
|
### Retry (After Failure)
|
||||||
|
|
||||||
|
1. Read failure feedback from verifier
|
||||||
|
2. Understand what went wrong
|
||||||
|
3. Fix the specific issue
|
||||||
|
4. Report what I changed
|
||||||
|
|
||||||
|
## Building Spec Format
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
building_spec:
|
||||||
|
description: What to build (high-level)
|
||||||
|
requirements:
|
||||||
|
- Specific requirement 1
|
||||||
|
- Specific requirement 2
|
||||||
|
constraints:
|
||||||
|
- Must use library X
|
||||||
|
- Must follow pattern Y
|
||||||
|
files:
|
||||||
|
- src/auth/login.ts
|
||||||
|
- src/auth/password.ts
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
```
|
||||||
|
📦 Implementation Complete
|
||||||
|
═══════════════════════════════════════
|
||||||
|
|
||||||
|
Files Created:
|
||||||
|
+ src/auth/login.ts
|
||||||
|
+ src/auth/password.ts
|
||||||
|
+ src/auth/types.ts
|
||||||
|
|
||||||
|
Files Modified:
|
||||||
|
~ src/routes/index.ts (added auth routes)
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
Implemented email/password auth with bcrypt hashing
|
||||||
|
and JWT token generation on login.
|
||||||
|
|
||||||
|
Ready for verification.
|
||||||
|
```
|
||||||
|
|
||||||
|
## On Retry
|
||||||
|
|
||||||
|
```
|
||||||
|
🔧 Fixing: error-handling check failed
|
||||||
|
═══════════════════════════════════════
|
||||||
|
|
||||||
|
Issue: Error messages not helpful
|
||||||
|
Expected: "Invalid email or password"
|
||||||
|
Actual: "Error 401"
|
||||||
|
|
||||||
|
Fix Applied:
|
||||||
|
~ src/auth/login.ts
|
||||||
|
- Changed generic error to descriptive message
|
||||||
|
- Added error codes for client handling
|
||||||
|
|
||||||
|
Ready for re-verification.
|
||||||
|
```
|
||||||
|
|
||||||
|
## Guidelines
|
||||||
|
|
||||||
|
1. **Read the spec carefully** — understand before coding
|
||||||
|
2. **Follow requirements exactly** — don't add unrequested features
|
||||||
|
3. **Write clean code** — the codebase standards apply
|
||||||
|
4. **Be minimal on retry** — fix only what failed, don't refactor
|
||||||
|
5. **Report clearly** — say what you did so verifier knows what to check
|
||||||
239
commands/eval.md
239
commands/eval.md
|
|
@ -1,169 +1,162 @@
|
||||||
---
|
---
|
||||||
description: Run eval commands - list, show, or verify evals
|
description: Eval commands - list, show, build, verify
|
||||||
argument-hint: list | show <name> | verify [name]
|
argument-hint: list | show <name> | build <name> | verify <name>
|
||||||
allowed-tools: Read, Bash, Task
|
allowed-tools: Read, Bash, Task
|
||||||
---
|
---
|
||||||
|
|
||||||
# /eval Command
|
# /eval Command
|
||||||
|
|
||||||
Interface for the eval system. I dispatch to the right action.
|
|
||||||
|
|
||||||
## Commands
|
## Commands
|
||||||
|
|
||||||
### /eval list
|
### /eval list
|
||||||
|
|
||||||
List all eval specs:
|
List all evals:
|
||||||
|
|
||||||
```bash
|
|
||||||
echo "Available evals:"
|
|
||||||
echo ""
|
|
||||||
for f in .claude/evals/*.yaml 2>/dev/null; do
|
|
||||||
if [ -f "$f" ]; then
|
|
||||||
name=$(basename "$f" .yaml)
|
|
||||||
desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
|
|
||||||
printf " %-20s %s\n" "$name" "$desc"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
```
|
```
|
||||||
|
Available evals:
|
||||||
If no evals exist:
|
auth Email/password authentication
|
||||||
```
|
checkout E-commerce checkout flow
|
||||||
No evals found in .claude/evals/
|
|
||||||
|
|
||||||
Create evals by asking: "Create evals for [feature]"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### /eval show <name>
|
### /eval show <name>
|
||||||
|
|
||||||
Display an eval spec:
|
Display the full eval spec.
|
||||||
|
|
||||||
```bash
|
### /eval build <name>
|
||||||
cat ".claude/evals/$1.yaml"
|
|
||||||
|
**The main command.** Orchestrates build → verify → fix loop.
|
||||||
|
|
||||||
|
```
|
||||||
|
/eval build auth
|
||||||
```
|
```
|
||||||
|
|
||||||
### /eval verify [name]
|
Flow:
|
||||||
|
1. Spawn **eval-builder** with building_spec
|
||||||
|
2. Builder implements, returns
|
||||||
|
3. Spawn **eval-verifier** with verification_spec
|
||||||
|
4. Verifier checks, returns pass/fail
|
||||||
|
5. If fail → spawn builder with failure context → goto 3
|
||||||
|
6. If pass → done
|
||||||
|
|
||||||
Run verification. This spawns the `eval-verifier` subagent.
|
Output:
|
||||||
|
|
||||||
**With name specified** (`/eval verify auth`):
|
|
||||||
|
|
||||||
Delegate to eval-verifier agent:
|
|
||||||
```
|
```
|
||||||
Run the eval-verifier agent to verify .claude/evals/auth.yaml
|
🔨 Building: auth
|
||||||
|
═══════════════════════════════════════
|
||||||
|
|
||||||
The agent should:
|
[Builder] Implementing...
|
||||||
1. Read the eval spec
|
+ src/auth/password.ts
|
||||||
2. Run all checks in the verify list
|
+ src/auth/jwt.ts
|
||||||
3. Collect evidence for agent checks
|
+ src/routes/auth.ts
|
||||||
4. Generate tests where generate_test: true
|
|
||||||
5. Report results with evidence
|
[Verifier] Checking...
|
||||||
|
✅ command: npm test (exit 0)
|
||||||
|
✅ file-contains: bcrypt
|
||||||
|
❌ api-login: Wrong status code
|
||||||
|
Expected: 401 on bad password
|
||||||
|
Actual: 500
|
||||||
|
|
||||||
|
[Builder] Fixing api-login...
|
||||||
|
~ src/routes/auth.ts
|
||||||
|
|
||||||
|
[Verifier] Re-checking...
|
||||||
|
✅ command: npm test (exit 0)
|
||||||
|
✅ file-contains: bcrypt
|
||||||
|
✅ api-login: Correct responses
|
||||||
|
📄 Test: tests/generated/test_auth_api_login.py
|
||||||
|
|
||||||
|
═══════════════════════════════════════
|
||||||
|
📊 Build complete: 3/3 checks passed
|
||||||
|
Iterations: 2
|
||||||
|
Tests generated: 1
|
||||||
```
|
```
|
||||||
|
|
||||||
**Without name** (`/eval verify`):
|
### /eval verify <name>
|
||||||
|
|
||||||
|
Just verify, don't build. For checking existing code.
|
||||||
|
|
||||||
|
```
|
||||||
|
/eval verify auth
|
||||||
|
```
|
||||||
|
|
||||||
|
Spawns verifier only. Reports pass/fail with evidence.
|
||||||
|
|
||||||
|
### /eval verify
|
||||||
|
|
||||||
Run all evals:
|
Run all evals:
|
||||||
```
|
```
|
||||||
Run the eval-verifier agent to verify all evals in .claude/evals/
|
/eval verify
|
||||||
|
|
||||||
For each .yaml file:
|
|
||||||
1. Read the eval spec
|
|
||||||
2. Run all checks
|
|
||||||
3. Collect evidence
|
|
||||||
4. Generate tests
|
|
||||||
5. Report results
|
|
||||||
|
|
||||||
Summarize overall results at the end.
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### /eval evidence <name>
|
### /eval evidence <name>
|
||||||
|
|
||||||
Show collected evidence for an eval:
|
Show collected evidence:
|
||||||
|
```
|
||||||
```bash
|
Evidence: auth
|
||||||
echo "Evidence for: $1"
|
- api-login-001.png
|
||||||
echo ""
|
- ui-login-001.png
|
||||||
if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
|
- evidence.json
|
||||||
cat ".claude/evals/.evidence/$1/evidence.json"
|
|
||||||
else
|
|
||||||
echo "No evidence collected yet. Run: /eval verify $1"
|
|
||||||
fi
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### /eval tests
|
### /eval tests
|
||||||
|
|
||||||
List generated tests:
|
List generated tests:
|
||||||
|
```
|
||||||
```bash
|
Generated tests:
|
||||||
echo "Generated tests:"
|
tests/generated/test_auth_api_login.py
|
||||||
echo ""
|
tests/generated/test_auth_ui_login.py
|
||||||
if [ -d "tests/generated" ]; then
|
|
||||||
ls -la tests/generated/
|
|
||||||
else
|
|
||||||
echo "No tests generated yet."
|
|
||||||
fi
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### /eval clean
|
### /eval clean
|
||||||
|
|
||||||
Clean evidence and generated tests:
|
Remove evidence and generated tests.
|
||||||
|
|
||||||
```bash
|
## Orchestration Logic
|
||||||
rm -rf .claude/evals/.evidence/
|
|
||||||
rm -rf tests/generated/
|
For `/eval build`:
|
||||||
echo "Cleaned evidence and generated tests."
|
|
||||||
|
```python
|
||||||
|
max_iterations = 5
|
||||||
|
iteration = 0
|
||||||
|
|
||||||
|
# Initial build
|
||||||
|
builder_result = spawn_agent("eval-builder", {
|
||||||
|
"spec": f".claude/evals/{name}.yaml",
|
||||||
|
"task": "implement"
|
||||||
|
})
|
||||||
|
|
||||||
|
while iteration < max_iterations:
|
||||||
|
# Verify
|
||||||
|
verifier_result = spawn_agent("eval-verifier", {
|
||||||
|
"spec": f".claude/evals/{name}.yaml"
|
||||||
|
})
|
||||||
|
|
||||||
|
if verifier_result.all_passed:
|
||||||
|
return success(verifier_result)
|
||||||
|
|
||||||
|
# Fix failures
|
||||||
|
builder_result = spawn_agent("eval-builder", {
|
||||||
|
"spec": f".claude/evals/{name}.yaml",
|
||||||
|
"task": "fix",
|
||||||
|
"failures": verifier_result.failures
|
||||||
|
})
|
||||||
|
|
||||||
|
iteration += 1
|
||||||
|
|
||||||
|
return failure("Max iterations reached")
|
||||||
```
|
```
|
||||||
|
|
||||||
## Workflow
|
## Context Flow
|
||||||
|
|
||||||
```
|
```
|
||||||
1. Create eval spec
|
Main Claude
|
||||||
> Create evals for user authentication
|
│
|
||||||
|
├─→ Builder (context: building_spec only)
|
||||||
2. List evals
|
│ └─→ Returns: files created
|
||||||
> /eval list
|
│
|
||||||
|
├─→ Verifier (context: verification_spec only)
|
||||||
3. Show specific eval
|
│ └─→ Returns: pass/fail + evidence
|
||||||
> /eval show auth
|
│
|
||||||
|
└─→ Builder (context: building_spec + failure only)
|
||||||
4. Run verification
|
└─→ Returns: files fixed
|
||||||
> /eval verify auth
|
|
||||||
|
|
||||||
5. Check evidence
|
|
||||||
> /eval evidence auth
|
|
||||||
|
|
||||||
6. Run generated tests
|
|
||||||
> pytest tests/generated/
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Examples
|
Each agent gets minimal, focused context. No bloat.
|
||||||
|
|
||||||
### /eval list
|
|
||||||
|
|
||||||
```
|
|
||||||
Available evals:
|
|
||||||
|
|
||||||
auth Email/password authentication with UI and API
|
|
||||||
todo-api REST API for todo management
|
|
||||||
checkout E-commerce checkout flow
|
|
||||||
```
|
|
||||||
|
|
||||||
### /eval verify auth
|
|
||||||
|
|
||||||
```
|
|
||||||
🔍 Eval: auth
|
|
||||||
═══════════════════════════════════════
|
|
||||||
|
|
||||||
Deterministic Checks:
|
|
||||||
✅ command: npm test -- --grep 'auth' (exit 0)
|
|
||||||
✅ file-contains: bcrypt in password.ts
|
|
||||||
|
|
||||||
Agent Checks:
|
|
||||||
✅ api-login: JWT returned correctly
|
|
||||||
📄 Test: tests/generated/test_auth_api_login.py
|
|
||||||
✅ ui-login: Dashboard redirect works
|
|
||||||
📸 Evidence: 2 screenshots
|
|
||||||
📄 Test: tests/generated/test_auth_ui_login.py
|
|
||||||
|
|
||||||
═══════════════════════════════════════
|
|
||||||
📊 Results: 4/4 passed
|
|
||||||
```
|
|
||||||
|
|
|
||||||
148
install.sh
148
install.sh
|
|
@ -1,154 +1,48 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
|
|
||||||
# Eval Skill Installer
|
echo "eval-skill installer"
|
||||||
# Installs the eval system: skill + verifier agent + command
|
echo "===================="
|
||||||
|
|
||||||
echo "╔══════════════════════════════════════╗"
|
|
||||||
echo "║ Eval Skill Installer ║"
|
|
||||||
echo "╚══════════════════════════════════════╝"
|
|
||||||
echo ""
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
# Parse args
|
|
||||||
INSTALL_GLOBAL=false
|
|
||||||
TARGET_DIR=".claude"
|
TARGET_DIR=".claude"
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
while [[ $# -gt 0 ]]; do
|
||||||
case $1 in
|
case $1 in
|
||||||
--global|-g)
|
--global|-g) TARGET_DIR="$HOME/.claude"; shift ;;
|
||||||
INSTALL_GLOBAL=true
|
|
||||||
TARGET_DIR="$HOME/.claude"
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--help|-h)
|
--help|-h)
|
||||||
echo "Usage: ./install.sh [OPTIONS]"
|
echo "Usage: ./install.sh [--global]"
|
||||||
echo ""
|
|
||||||
echo "Options:"
|
|
||||||
echo " --global, -g Install to ~/.claude (all projects)"
|
echo " --global, -g Install to ~/.claude (all projects)"
|
||||||
echo " --help, -h Show this help"
|
echo " Default: ./.claude (current project)"
|
||||||
echo ""
|
exit 0 ;;
|
||||||
echo "Default: Install to ./.claude (current project)"
|
*) echo "Unknown: $1"; exit 1 ;;
|
||||||
exit 0
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
echo "Unknown option: $1"
|
|
||||||
exit 1
|
|
||||||
;;
|
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
if [ "$INSTALL_GLOBAL" = true ]; then
|
echo "Installing to: $TARGET_DIR"
|
||||||
echo "📍 Installing globally: $TARGET_DIR"
|
|
||||||
else
|
|
||||||
echo "📍 Installing to project: $(pwd)/$TARGET_DIR"
|
|
||||||
fi
|
|
||||||
echo ""
|
|
||||||
|
|
||||||
# Create directories
|
# Create dirs
|
||||||
echo "Creating directories..."
|
|
||||||
mkdir -p "$TARGET_DIR/skills/eval"
|
mkdir -p "$TARGET_DIR/skills/eval"
|
||||||
mkdir -p "$TARGET_DIR/commands"
|
mkdir -p "$TARGET_DIR/commands"
|
||||||
mkdir -p "$TARGET_DIR/agents"
|
mkdir -p "$TARGET_DIR/agents"
|
||||||
mkdir -p "$TARGET_DIR/evals"
|
mkdir -p "$TARGET_DIR/evals"
|
||||||
|
|
||||||
# Install skill
|
# Install files
|
||||||
echo "Installing eval skill..."
|
|
||||||
cp "$SCRIPT_DIR/skills/eval/SKILL.md" "$TARGET_DIR/skills/eval/SKILL.md"
|
cp "$SCRIPT_DIR/skills/eval/SKILL.md" "$TARGET_DIR/skills/eval/SKILL.md"
|
||||||
echo " ✅ $TARGET_DIR/skills/eval/SKILL.md"
|
cp "$SCRIPT_DIR/agents/eval-builder.md" "$TARGET_DIR/agents/eval-builder.md"
|
||||||
|
|
||||||
# Install verifier agent
|
|
||||||
echo "Installing eval-verifier agent..."
|
|
||||||
cp "$SCRIPT_DIR/agents/eval-verifier.md" "$TARGET_DIR/agents/eval-verifier.md"
|
cp "$SCRIPT_DIR/agents/eval-verifier.md" "$TARGET_DIR/agents/eval-verifier.md"
|
||||||
echo " ✅ $TARGET_DIR/agents/eval-verifier.md"
|
|
||||||
|
|
||||||
# Install command
|
|
||||||
echo "Installing /eval command..."
|
|
||||||
cp "$SCRIPT_DIR/commands/eval.md" "$TARGET_DIR/commands/eval.md"
|
cp "$SCRIPT_DIR/commands/eval.md" "$TARGET_DIR/commands/eval.md"
|
||||||
echo " ✅ $TARGET_DIR/commands/eval.md"
|
|
||||||
|
|
||||||
# Create example eval
|
echo "✓ Installed"
|
||||||
if [ ! -f "$TARGET_DIR/evals/example.yaml" ]; then
|
|
||||||
echo "Creating example eval..."
|
|
||||||
cat > "$TARGET_DIR/evals/example.yaml" << 'EOF'
|
|
||||||
name: example
|
|
||||||
description: Example eval demonstrating the format
|
|
||||||
|
|
||||||
test_output:
|
|
||||||
framework: pytest
|
|
||||||
path: tests/generated/
|
|
||||||
|
|
||||||
verify:
|
|
||||||
# === DETERMINISTIC CHECKS ===
|
|
||||||
|
|
||||||
- type: file-exists
|
|
||||||
path: README.md
|
|
||||||
|
|
||||||
- type: command
|
|
||||||
run: "echo 'hello world'"
|
|
||||||
expect: exit_code 0
|
|
||||||
|
|
||||||
# === AGENT CHECKS ===
|
|
||||||
|
|
||||||
- type: agent
|
|
||||||
name: readme-quality
|
|
||||||
prompt: |
|
|
||||||
Read README.md and verify:
|
|
||||||
1. Has a title/heading
|
|
||||||
2. Explains what the project does
|
|
||||||
3. Has installation instructions
|
|
||||||
evidence:
|
|
||||||
- text: "# "
|
|
||||||
generate_test: false # Subjective, no test
|
|
||||||
EOF
|
|
||||||
echo " ✅ $TARGET_DIR/evals/example.yaml"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check dependencies
|
|
||||||
echo ""
|
echo ""
|
||||||
echo "Checking optional dependencies..."
|
echo "Components:"
|
||||||
if command -v agent-browser &> /dev/null; then
|
echo " Skill: $TARGET_DIR/skills/eval/"
|
||||||
echo " ✅ agent-browser installed"
|
echo " Builder: $TARGET_DIR/agents/eval-builder.md"
|
||||||
else
|
echo " Verifier: $TARGET_DIR/agents/eval-verifier.md"
|
||||||
echo " ⚠️ agent-browser not found (needed for UI testing)"
|
echo " Command: $TARGET_DIR/commands/eval.md"
|
||||||
echo " npm install -g @anthropic/agent-browser"
|
echo " Evals: $TARGET_DIR/evals/"
|
||||||
fi
|
|
||||||
|
|
||||||
# Success
|
|
||||||
echo ""
|
|
||||||
echo "╔══════════════════════════════════════╗"
|
|
||||||
echo "║ Installation Complete ║"
|
|
||||||
echo "╚══════════════════════════════════════╝"
|
|
||||||
echo ""
|
|
||||||
echo "What was installed:"
|
|
||||||
echo ""
|
|
||||||
echo " 📋 Skill: eval"
|
|
||||||
echo " Generates eval specs (YAML)"
|
|
||||||
echo " Location: $TARGET_DIR/skills/eval/"
|
|
||||||
echo ""
|
|
||||||
echo " 🤖 Agent: eval-verifier"
|
|
||||||
echo " Runs checks, collects evidence, generates tests"
|
|
||||||
echo " Location: $TARGET_DIR/agents/"
|
|
||||||
echo ""
|
|
||||||
echo " ⌨️ Command: /eval"
|
|
||||||
echo " CLI: list | show | verify"
|
|
||||||
echo " Location: $TARGET_DIR/commands/"
|
|
||||||
echo ""
|
|
||||||
echo " 📁 Evals Directory: $TARGET_DIR/evals/"
|
|
||||||
echo " Your eval specs go here"
|
|
||||||
echo ""
|
echo ""
|
||||||
echo "Usage:"
|
echo "Usage:"
|
||||||
echo ""
|
echo " Create evals: 'Create evals for [feature]'"
|
||||||
echo " 1. Create evals:"
|
echo " Build+verify: /eval build <name>"
|
||||||
echo " > Create evals for user authentication"
|
echo " Verify only: /eval verify <name>"
|
||||||
echo ""
|
|
||||||
echo " 2. List evals:"
|
|
||||||
echo " > /eval list"
|
|
||||||
echo ""
|
|
||||||
echo " 3. Run verification:"
|
|
||||||
echo " > /eval verify auth"
|
|
||||||
echo ""
|
|
||||||
echo " 4. Run generated tests:"
|
|
||||||
echo " > pytest tests/generated/"
|
|
||||||
echo ""
|
|
||||||
|
|
|
||||||
|
|
@ -1,280 +1,244 @@
|
||||||
---
|
---
|
||||||
name: eval
|
name: eval
|
||||||
description: Generate evaluation specs for code verification. Use when setting up tests, defining acceptance criteria, or creating verification checkpoints before implementing features. Triggers on "create evals", "define acceptance criteria", "set up verification", or "how will we know this works".
|
description: Generate evaluation specs with building and verification criteria. Use when setting up features, defining acceptance criteria, or before implementing anything significant. Triggers on "create evals", "set up verification", "define acceptance criteria", or "build [feature]".
|
||||||
allowed-tools: Read, Grep, Glob, Write, Edit
|
allowed-tools: Read, Grep, Glob, Write, Edit
|
||||||
---
|
---
|
||||||
|
|
||||||
# Eval Skill
|
# Eval Skill
|
||||||
|
|
||||||
Generate evaluation specs (YAML) that define what to verify. I do NOT run verification — that's the verifier agent's job.
|
Generate specs that define **what to build** and **how to verify it**.
|
||||||
|
|
||||||
## My Responsibilities
|
## Output
|
||||||
|
|
||||||
1. Understand what needs verification
|
I create `.claude/evals/<name>.yaml` with two sections:
|
||||||
2. Ask clarifying questions
|
|
||||||
3. Generate `.claude/evals/<name>.yaml` specs
|
|
||||||
4. Define checks with clear success criteria
|
|
||||||
|
|
||||||
## What I Do NOT Do
|
1. **building_spec** — What the builder agent implements
|
||||||
|
2. **verification_spec** — What the verifier agent checks
|
||||||
|
|
||||||
- Run tests or commands
|
## Format
|
||||||
- Collect evidence
|
|
||||||
- Generate test code
|
|
||||||
- Make pass/fail judgments
|
|
||||||
|
|
||||||
## Eval Spec Format
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
name: feature-name
|
name: feature-name
|
||||||
description: What this eval verifies
|
description: One-line summary
|
||||||
|
|
||||||
|
building_spec:
|
||||||
|
description: What to build
|
||||||
|
requirements:
|
||||||
|
- Requirement 1
|
||||||
|
- Requirement 2
|
||||||
|
constraints:
|
||||||
|
- Constraint 1
|
||||||
|
files:
|
||||||
|
- suggested/file/paths.ts
|
||||||
|
|
||||||
# Where generated tests should go
|
|
||||||
test_output:
|
test_output:
|
||||||
framework: pytest # or vitest, jest
|
framework: pytest | vitest | jest
|
||||||
path: tests/generated/
|
path: tests/generated/
|
||||||
|
|
||||||
verify:
|
verification_spec:
|
||||||
# === DETERMINISTIC CHECKS ===
|
# Deterministic checks
|
||||||
# These run as-is, fast and reliable
|
|
||||||
|
|
||||||
- type: command
|
- type: command
|
||||||
run: "npm test -- --grep 'auth'"
|
run: "npm test"
|
||||||
expect: exit_code 0
|
expect: exit_code 0
|
||||||
|
|
||||||
- type: file-exists
|
# Agent checks
|
||||||
path: src/auth/login.ts
|
|
||||||
|
|
||||||
- type: file-contains
|
|
||||||
path: src/auth/login.ts
|
|
||||||
pattern: "export function login"
|
|
||||||
|
|
||||||
- type: file-not-contains
|
|
||||||
path: src/config.ts
|
|
||||||
pattern: "API_KEY=sk-"
|
|
||||||
|
|
||||||
# === AGENT CHECKS ===
|
|
||||||
# Verifier agent runs these, collects evidence, generates tests
|
|
||||||
|
|
||||||
- type: agent
|
- type: agent
|
||||||
name: login-flow # Used for evidence/test naming
|
name: check-name
|
||||||
prompt: |
|
prompt: |
|
||||||
Verify login with valid credentials:
|
What to verify
|
||||||
1. Navigate to /login
|
|
||||||
2. Enter test@example.com / password123
|
|
||||||
3. Submit form
|
|
||||||
4. Verify redirect to /dashboard
|
|
||||||
5. Verify welcome message visible
|
|
||||||
evidence:
|
evidence:
|
||||||
- screenshot: after-login
|
- screenshot: name
|
||||||
- url: contains "/dashboard"
|
- url: contains "pattern"
|
||||||
- element: '[data-testid="welcome"]'
|
generate_test: true
|
||||||
generate_test: true # Verifier should write a test for this
|
|
||||||
```
|
|
||||||
|
|
||||||
## Check Types
|
|
||||||
|
|
||||||
### Deterministic (Verifier runs directly)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# Command with exit code
|
|
||||||
- type: command
|
|
||||||
run: "pytest tests/auth/"
|
|
||||||
expect: exit_code 0
|
|
||||||
|
|
||||||
# Command with output check
|
|
||||||
- type: command
|
|
||||||
run: "curl -s localhost:3000/health"
|
|
||||||
expect:
|
|
||||||
contains: '"status":"ok"'
|
|
||||||
|
|
||||||
# File existence
|
|
||||||
- type: file-exists
|
|
||||||
path: src/feature.ts
|
|
||||||
|
|
||||||
# File content
|
|
||||||
- type: file-contains
|
|
||||||
path: src/feature.ts
|
|
||||||
pattern: "export function"
|
|
||||||
|
|
||||||
# Negative file content
|
|
||||||
- type: file-not-contains
|
|
||||||
path: .env.example
|
|
||||||
pattern: "real-api-key"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Agent (Verifier interprets, collects evidence, may generate test)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
- type: agent
|
|
||||||
name: descriptive-name
|
|
||||||
prompt: |
|
|
||||||
Clear instructions for what to verify.
|
|
||||||
Be specific about:
|
|
||||||
- What to do
|
|
||||||
- What to check
|
|
||||||
- What success looks like
|
|
||||||
evidence:
|
|
||||||
- screenshot: step-name # Capture screenshot
|
|
||||||
- url: contains "pattern" # Check URL
|
|
||||||
- element: "selector" # Check element exists
|
|
||||||
- text: "expected text" # Check text visible
|
|
||||||
- response: status 200 # Check HTTP response
|
|
||||||
generate_test: true|false # Should verifier write a test?
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Workflow
|
## Workflow
|
||||||
|
|
||||||
### User asks to create evals
|
### User Request
|
||||||
|
|
||||||
**User**: Create evals for user authentication
|
```
|
||||||
|
Create evals for user authentication
|
||||||
|
```
|
||||||
|
|
||||||
**Me**: I'll define verification criteria. Questions:
|
### My Questions
|
||||||
1. Auth method? (email/password, OAuth, magic link?)
|
|
||||||
2. Should I verify UI, API, or both?
|
|
||||||
3. Security requirements? (password hashing, rate limiting?)
|
|
||||||
|
|
||||||
**User**: Email/password, both UI and API, must hash passwords
|
Before generating, I ask:
|
||||||
|
1. What auth method? (email/password, OAuth, magic link?)
|
||||||
|
2. UI, API, or both?
|
||||||
|
3. Specific security requirements?
|
||||||
|
|
||||||
**Me**: Creating `.claude/evals/auth.yaml`:
|
### My Output
|
||||||
|
|
||||||
|
`.claude/evals/auth.yaml`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
name: auth
|
name: auth
|
||||||
description: Email/password authentication with UI and API
|
description: Email/password authentication with UI and API
|
||||||
|
|
||||||
|
building_spec:
|
||||||
|
description: |
|
||||||
|
User authentication system with email/password.
|
||||||
|
Secure password storage, JWT tokens, login/signup flows.
|
||||||
|
requirements:
|
||||||
|
- Password hashing with bcrypt (cost factor 12+)
|
||||||
|
- JWT tokens with 24h expiry
|
||||||
|
- POST /api/auth/login endpoint
|
||||||
|
- POST /api/auth/signup endpoint
|
||||||
|
- Login page at /login
|
||||||
|
- Signup page at /signup
|
||||||
|
- Protected route middleware
|
||||||
|
constraints:
|
||||||
|
- No plaintext passwords anywhere
|
||||||
|
- Tokens must be httpOnly cookies or secure headers
|
||||||
|
files:
|
||||||
|
- src/auth/password.ts
|
||||||
|
- src/auth/jwt.ts
|
||||||
|
- src/auth/middleware.ts
|
||||||
|
- src/routes/auth.ts
|
||||||
|
- src/pages/login.tsx
|
||||||
|
- src/pages/signup.tsx
|
||||||
|
|
||||||
test_output:
|
test_output:
|
||||||
framework: pytest
|
framework: pytest
|
||||||
path: tests/generated/
|
path: tests/generated/
|
||||||
|
|
||||||
verify:
|
verification_spec:
|
||||||
# --- Deterministic ---
|
# --- Deterministic ---
|
||||||
- type: command
|
- type: command
|
||||||
run: "npm test -- --grep 'auth'"
|
run: "npm test -- --grep auth"
|
||||||
expect: exit_code 0
|
expect: exit_code 0
|
||||||
|
|
||||||
- type: file-contains
|
- type: file-contains
|
||||||
path: src/auth/password.ts
|
path: src/auth/password.ts
|
||||||
pattern: "bcrypt|argon2|scrypt"
|
pattern: "bcrypt"
|
||||||
|
|
||||||
- type: file-not-contains
|
- type: file-not-contains
|
||||||
path: src/auth/
|
path: src/
|
||||||
pattern: "password.*=.*plaintext"
|
pattern: "password.*=.*plaintext"
|
||||||
|
|
||||||
# --- Agent: API Login ---
|
# --- Agent: API ---
|
||||||
- type: agent
|
- type: agent
|
||||||
name: api-login
|
name: api-login
|
||||||
prompt: |
|
prompt: |
|
||||||
Test login API:
|
Test login API:
|
||||||
1. POST /api/auth/login with {"email": "test@example.com", "password": "password123"}
|
1. POST /api/auth/signup with new user
|
||||||
2. Verify 200 response with JWT token
|
2. Verify 201 response
|
||||||
3. POST with wrong password
|
3. POST /api/auth/login with same creds
|
||||||
4. Verify 401 with helpful error message
|
4. Verify 200 with JWT token
|
||||||
|
5. POST /api/auth/login with wrong password
|
||||||
|
6. Verify 401 with helpful message
|
||||||
evidence:
|
evidence:
|
||||||
|
- response: status 201
|
||||||
- response: status 200
|
- response: status 200
|
||||||
- response: has "token"
|
- response: has "token"
|
||||||
- response: status 401
|
- response: status 401
|
||||||
generate_test: true
|
generate_test: true
|
||||||
|
|
||||||
# --- Agent: UI Login ---
|
# --- Agent: UI ---
|
||||||
- type: agent
|
- type: agent
|
||||||
name: ui-login
|
name: ui-login
|
||||||
prompt: |
|
prompt: |
|
||||||
Test login UI:
|
Test login UI:
|
||||||
1. Go to /login
|
1. Go to /login
|
||||||
2. Enter valid credentials
|
2. Verify form has email + password fields
|
||||||
3. Submit
|
3. Submit with valid credentials
|
||||||
4. Verify redirect to /dashboard
|
4. Verify redirect to /dashboard
|
||||||
5. Verify welcome message
|
5. Verify welcome message visible
|
||||||
evidence:
|
evidence:
|
||||||
- screenshot: login-page
|
- screenshot: login-page
|
||||||
- screenshot: after-submit
|
- screenshot: after-login
|
||||||
- url: contains "/dashboard"
|
- url: contains "/dashboard"
|
||||||
- element: '[data-testid="welcome"]'
|
- element: '[data-testid="welcome"]'
|
||||||
generate_test: true
|
generate_test: true
|
||||||
|
|
||||||
# --- Agent: Error Handling ---
|
# --- Agent: Security ---
|
||||||
- type: agent
|
- type: agent
|
||||||
name: login-errors
|
name: password-security
|
||||||
prompt: |
|
prompt: |
|
||||||
Test error handling:
|
Verify password security:
|
||||||
1. Submit empty form - verify field validation
|
1. Read src/auth/password.ts
|
||||||
2. Submit wrong password - verify error message is helpful
|
2. Confirm bcrypt with cost >= 12
|
||||||
3. Verify password field clears but email persists
|
3. Confirm no password logging
|
||||||
|
4. Check signup doesn't echo password
|
||||||
evidence:
|
evidence:
|
||||||
- screenshot: validation-errors
|
- text: "bcrypt"
|
||||||
- screenshot: wrong-password
|
- text: "cost" or "rounds"
|
||||||
- text: "Invalid email or password"
|
generate_test: false # Code review, not repeatable test
|
||||||
generate_test: true
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Now run `/eval verify auth` to have the verifier agent:
|
## Check Types
|
||||||
1. Run deterministic checks
|
|
||||||
2. Execute agent checks with evidence collection
|
### Deterministic
|
||||||
3. Generate tests in `tests/generated/`
|
|
||||||
4. Report results
|
```yaml
|
||||||
|
- type: command
|
||||||
|
run: "shell command"
|
||||||
|
expect: exit_code 0
|
||||||
|
|
||||||
|
- type: command
|
||||||
|
run: "curl localhost:3000/health"
|
||||||
|
expect:
|
||||||
|
contains: '"ok"'
|
||||||
|
|
||||||
|
- type: file-exists
|
||||||
|
path: src/file.ts
|
||||||
|
|
||||||
|
- type: file-contains
|
||||||
|
path: src/file.ts
|
||||||
|
pattern: "regex pattern"
|
||||||
|
|
||||||
|
- type: file-not-contains
|
||||||
|
path: src/file.ts
|
||||||
|
pattern: "bad pattern"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- type: agent
|
||||||
|
name: descriptive-name # Used for evidence/test naming
|
||||||
|
prompt: |
|
||||||
|
Step-by-step verification
|
||||||
|
evidence:
|
||||||
|
- screenshot: step-name
|
||||||
|
- url: contains "pattern"
|
||||||
|
- element: "css-selector"
|
||||||
|
- text: "expected text"
|
||||||
|
- response: status 200
|
||||||
|
- response: has "field"
|
||||||
|
generate_test: true | false
|
||||||
|
```
|
||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
### Be Specific in Prompts
|
### Building Spec
|
||||||
```yaml
|
|
||||||
# ❌ Vague
|
|
||||||
prompt: "Make sure login works"
|
|
||||||
|
|
||||||
# ✅ Specific
|
- **Be specific** — "bcrypt with cost 12" not "secure passwords"
|
||||||
prompt: |
|
- **List files** — helps builder know where to put code
|
||||||
1. Navigate to /login
|
- **State constraints** — what NOT to do matters
|
||||||
2. Enter test@example.com in email field
|
|
||||||
3. Enter password123 in password field
|
|
||||||
4. Click submit button
|
|
||||||
5. Verify URL is /dashboard
|
|
||||||
6. Verify text "Welcome" is visible
|
|
||||||
```
|
|
||||||
|
|
||||||
### Specify Evidence
|
### Verification Spec
|
||||||
```yaml
|
|
||||||
# ❌ No evidence
|
|
||||||
- type: agent
|
|
||||||
prompt: "Check the UI looks right"
|
|
||||||
|
|
||||||
# ✅ Evidence defined
|
- **Deterministic first** — fast, reliable checks
|
||||||
- type: agent
|
- **Agent for semantics** — UI flows, code quality, error messages
|
||||||
prompt: "Check login form has email and password fields"
|
- **Evidence always** — no claim without proof
|
||||||
evidence:
|
- **generate_test for repeatables** — UI flows yes, code review no
|
||||||
- screenshot: login-form
|
|
||||||
- element: 'input[type="email"]'
|
|
||||||
- element: 'input[type="password"]'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Enable Test Generation for Repeatables
|
### Naming
|
||||||
```yaml
|
|
||||||
# UI flows → generate tests (repeatable)
|
|
||||||
- type: agent
|
|
||||||
name: checkout-flow
|
|
||||||
generate_test: true
|
|
||||||
|
|
||||||
# Subjective review → no test (human judgment)
|
- `name: feature-name` — lowercase, hyphens
|
||||||
- type: agent
|
- `name: api-login` — for agent checks, descriptive
|
||||||
name: code-quality
|
|
||||||
generate_test: false
|
|
||||||
prompt: "Review error messages for helpfulness"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Directory Structure
|
## What Happens Next
|
||||||
|
|
||||||
After running evals:
|
After I create the spec:
|
||||||
|
|
||||||
```
|
```
|
||||||
.claude/
|
/eval build auth
|
||||||
├── evals/
|
|
||||||
│ ├── auth.yaml # Eval spec (I create this)
|
|
||||||
│ └── .evidence/
|
|
||||||
│ ├── auth/
|
|
||||||
│ │ ├── ui-login-001.png
|
|
||||||
│ │ ├── ui-login-002.png
|
|
||||||
│ │ └── evidence.json # Structured evidence
|
|
||||||
│ └── ...
|
|
||||||
tests/
|
|
||||||
└── generated/
|
|
||||||
├── test_auth_api_login.py # Verifier generates
|
|
||||||
├── test_auth_ui_login.py # Verifier generates
|
|
||||||
└── ...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
1. Builder agent reads `building_spec`, implements
|
||||||
|
2. Verifier agent reads `verification_spec`, checks
|
||||||
|
3. If fail → builder gets feedback → fixes → verifier re-checks
|
||||||
|
4. Loop until pass
|
||||||
|
5. Agent checks become tests in `tests/generated/`
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue