This commit is contained in:
Harivansh Rathi 2026-01-14 00:11:59 -08:00
parent aca2126c88
commit 7c63331389
5 changed files with 520 additions and 664 deletions

354
README.md
View file

@ -1,293 +1,187 @@
# eval-skill
Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.
Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.
## The Problem
## Why
> *"How will the agent know it did the right thing?"*
> — [Thorsten Ball](https://x.com/thorstenball)
Without verification, Claude implements and hopes. With verification, Claude implements and **knows**.
Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.
## The Solution
## How It Works
```
┌─────────────────────────────────────────────────────────────┐
│ 1. SKILL: eval │
│ "Create evals for auth" │
│ → Generates .claude/evals/auth.yaml │
└─────────────────────────────────────────────────────────────┘
You: "Build auth with email/password"
┌─────────────────────────────────────────────────────────────┐
│ 2. AGENT: eval-verifier │
│ "/eval verify auth" │
│ → Runs checks │
│ → Collects evidence (screenshots, outputs) │
│ → Generates executable tests │
│ → Reports pass/fail │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Skill: eval │
│ Generates: │
│ • verification spec (tests) │
│ • building spec (what to build) │
└─────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. OUTPUT │
│ .claude/evals/.evidence/auth/ ← Screenshots, logs │
│ tests/generated/test_auth.py ← Executable tests │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Agent: builder │
│ Implements from building spec │
│ Clean context, focused on code │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Agent: verifier │
│ Runs checks, collects evidence │
│ Returns pass/fail │
└─────────────────────────────────────┘
Pass? Done.
Fail? → Builder fixes → Verifier checks → Loop
```
Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.
## Install
```bash
git clone https://github.com/yourusername/eval-skill.git
cd eval-skill
# Install to current project
./install.sh
# Or install globally (all projects)
./install.sh --global
./install.sh # Current project
./install.sh --global # All projects
```
## Usage
### 1. Create Evals (Before Implementation)
### Step 1: Create Specs
```
> Create evals for user authentication
Create evals for user authentication with email/password
```
Claude generates `.claude/evals/auth.yaml`:
Creates `.claude/evals/auth.yaml`:
```yaml
name: auth
description: Email/password authentication
test_output:
framework: pytest
path: tests/generated/
building_spec:
description: Email/password auth with login/signup
requirements:
- Password hashing with bcrypt
- JWT tokens on login
- /login and /signup endpoints
verify:
# Deterministic
verification_spec:
- type: command
run: "npm test -- --grep 'auth'"
run: "npm test -- --grep auth"
expect: exit_code 0
- type: file-contains
path: src/auth/password.ts
pattern: "bcrypt|argon2"
pattern: "bcrypt"
# Agent-based (with evidence + test generation)
- type: agent
name: login-flow
prompt: |
1. POST /api/login with valid creds
2. Verify JWT in response
3. POST with wrong password
4. Verify 401 + helpful error
generate_test: true
```
### Step 2: Build
```
/eval build auth
```
Spawns builder agent → implements → spawns verifier → checks → iterates until pass.
### Step 3: Run Generated Tests (Forever)
```bash
pytest tests/generated/
```
Agent checks become deterministic tests. First run costs tokens. Future runs are free.
## Commands
| Command | What it does |
|---------|--------------|
| `/eval list` | List all evals |
| `/eval show <name>` | Display spec |
| `/eval build <name>` | Build + verify loop |
| `/eval verify <name>` | Just verify, no build |
## Why Context Isolation Matters
**Without isolation:**
```
Main Claude context:
- All verification logic
- All implementation code
- All error history
- Context bloat → degraded performance
```
**With isolation:**
```
Builder context: building spec + current failure only
Verifier context: verification spec + current code only
Main Claude: just orchestration
```
Each agent gets exactly what it needs. Nothing more.
## Check Types
**Deterministic** (fast, no agent):
```yaml
- type: command
run: "npm test"
expect: exit_code 0
- type: file-contains
path: src/auth.ts
pattern: "bcrypt"
```
**Agent** (semantic, generates tests):
```yaml
- type: agent
name: ui-login
prompt: |
1. Go to /login
2. Enter test@example.com / password123
3. Submit
4. Verify redirect to /dashboard
prompt: "Navigate to /login, submit form, verify redirect"
evidence:
- screenshot: after-login
- url: contains "/dashboard"
generate_test: true
```
### 2. Implement
```
> Implement auth based on .claude/evals/auth.yaml
```
### 3. Verify
```
> /eval verify auth
```
Output:
```
🔍 Eval: auth
═══════════════════════════════════════
Deterministic:
✅ command: npm test (exit 0)
✅ file-contains: bcrypt in password.ts
Agent:
✅ ui-login: Dashboard redirect works
📸 Evidence: 2 screenshots saved
📄 Test: tests/generated/test_auth_ui_login.py
═══════════════════════════════════════
📊 Results: 3/3 passed
```
### 4. Run Generated Tests (Forever)
```bash
pytest tests/generated/
```
The agent converted its semantic verification into deterministic tests.
## How It Works
### Non-Deterministic → Deterministic
Agent checks are semantic: "verify login works." But we need proof.
1. **Verifier runs the check** (browser automation, API calls, file inspection)
2. **Collects evidence** (screenshots, responses, DOM snapshots)
3. **Generates executable test** (pytest/vitest)
4. **Future runs use the test** (no agent needed)
```
Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable)
↓ ↓ ↓
"Login works" screenshot + url check pytest + playwright
```
### Evidence-Based Verification
The verifier can't just say "pass." It must provide evidence:
```yaml
- type: agent
name: login-flow
prompt: "Verify login redirects to dashboard"
evidence:
- screenshot: login-page
- screenshot: after-submit
- url: contains "/dashboard"
- element: '[data-testid="welcome"]'
```
Evidence is saved to `.claude/evals/.evidence/<eval>/`:
```json
{
"eval": "auth",
"checks": [{
"name": "login-flow",
"pass": true,
"evidence": [
{"type": "screenshot", "path": "login-page.png"},
{"type": "screenshot", "path": "after-submit.png"},
{"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
{"type": "element", "selector": "[data-testid=welcome]", "found": true}
]
}]
}
```
## Check Types
### Deterministic (Fast, No Agent)
```yaml
# Command + exit code
- type: command
run: "pytest tests/"
expect: exit_code 0
# Command + output
- type: command
run: "curl localhost:3000/health"
expect:
contains: '"status":"ok"'
# File exists
- type: file-exists
path: src/feature.ts
# File contains pattern
- type: file-contains
path: src/auth.ts
pattern: "bcrypt"
# File does NOT contain
- type: file-not-contains
path: .env
pattern: "sk-"
```
### Agent (Semantic, Evidence-Based)
```yaml
- type: agent
name: descriptive-name
prompt: |
Step-by-step verification instructions
evidence:
- screenshot: step-name
- url: contains "pattern"
- element: "css-selector"
- text: "expected text"
- response: status 200
generate_test: true # Write executable test
```
## Commands
| Command | Description |
|---------|-------------|
| `/eval list` | List all evals |
| `/eval show <name>` | Display eval spec |
| `/eval verify <name>` | Run verification |
| `/eval verify` | Run all evals |
| `/eval evidence <name>` | Show collected evidence |
| `/eval tests` | List generated tests |
| `/eval clean` | Remove evidence + generated tests |
Agent checks produce evidence (screenshots, responses) and become executable tests.
## Directory Structure
```
.claude/
├── skills/eval/SKILL.md # Eval generation skill
├── agents/eval-verifier.md # Verification agent
├── commands/eval.md # /eval command
├── skills/eval/ # Generates specs
├── agents/
│ ├── eval-builder.md
│ └── eval-verifier.md
├── commands/eval.md
└── evals/
├── auth.yaml # Your eval specs
├── checkout.yaml
└── .evidence/
├── auth/
│ ├── evidence.json
│ └── *.png
└── checkout/
└── ...
├── auth.yaml
└── .evidence/ # Screenshots, logs
tests/
└── generated/ # Tests written by verifier
├── test_auth_ui_login.py
└── test_auth_api_login.py
tests/generated/ # Tests from agent checks
```
## Requirements
- Claude Code with skills/agents/commands support
- Claude Code
- For UI testing: `npm install -g @anthropic/agent-browser`
## Philosophy
**TDD for Agents:**
| Traditional TDD | Agent TDD |
|----------------|-----------|
| Write tests | Write evals |
| Write code | Claude writes code |
| Tests pass | Claude verifies + generates tests |
**Why generate tests?**
Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.
**Mix deterministic and semantic:**
- Deterministic: "tests pass", "file exists", "command succeeds"
- Semantic: "UI looks right", "error is helpful", "code is readable"
Use deterministic where possible, semantic where necessary.
## License
MIT

111
agents/eval-builder.md Normal file
View file

@ -0,0 +1,111 @@
---
name: eval-builder
description: Implementation agent that builds features from building specs. Use when running /eval build.
tools: Read, Write, Edit, Bash, Grep, Glob
model: sonnet
permissionMode: acceptEdits
---
# Eval Builder Agent
I implement features based on building specs. I don't verify — that's the verifier's job.
## My Responsibilities
1. Read the building spec from eval YAML
2. Implement the requirements
3. Write clean, working code
4. Report what I built
## What I Do NOT Do
- Run verification checks (verifier does this)
- Collect evidence (verifier does this)
- Generate tests (verifier does this)
- Decide if my work is correct (verifier does this)
## Input
I receive:
1. **Eval spec path**: `.claude/evals/<name>.yaml`
2. **Failure context** (if retrying): What failed and why
## Process
### First Run
1. Read the eval spec
2. Extract `building_spec` section
3. Understand requirements
4. Implement the feature
5. Report files created/modified
### Retry (After Failure)
1. Read failure feedback from verifier
2. Understand what went wrong
3. Fix the specific issue
4. Report what I changed
## Building Spec Format
```yaml
building_spec:
description: What to build (high-level)
requirements:
- Specific requirement 1
- Specific requirement 2
constraints:
- Must use library X
- Must follow pattern Y
files:
- src/auth/login.ts
- src/auth/password.ts
```
## Output Format
```
📦 Implementation Complete
═══════════════════════════════════════
Files Created:
+ src/auth/login.ts
+ src/auth/password.ts
+ src/auth/types.ts
Files Modified:
~ src/routes/index.ts (added auth routes)
Summary:
Implemented email/password auth with bcrypt hashing
and JWT token generation on login.
Ready for verification.
```
## On Retry
```
🔧 Fixing: error-handling check failed
═══════════════════════════════════════
Issue: Error messages not helpful
Expected: "Invalid email or password"
Actual: "Error 401"
Fix Applied:
~ src/auth/login.ts
- Changed generic error to descriptive message
- Added error codes for client handling
Ready for re-verification.
```
## Guidelines
1. **Read the spec carefully** — understand before coding
2. **Follow requirements exactly** — don't add unrequested features
3. **Write clean code** — the codebase standards apply
4. **Be minimal on retry** — fix only what failed, don't refactor
5. **Report clearly** — say what you did so verifier knows what to check

View file

@ -1,169 +1,162 @@
---
description: Run eval commands - list, show, or verify evals
argument-hint: list | show <name> | verify [name]
description: Eval commands - list, show, build, verify
argument-hint: list | show <name> | build <name> | verify <name>
allowed-tools: Read, Bash, Task
---
# /eval Command
Interface for the eval system. I dispatch to the right action.
## Commands
### /eval list
List all eval specs:
```bash
echo "Available evals:"
echo ""
for f in .claude/evals/*.yaml 2>/dev/null; do
if [ -f "$f" ]; then
name=$(basename "$f" .yaml)
desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
printf " %-20s %s\n" "$name" "$desc"
fi
done
List all evals:
```
If no evals exist:
```
No evals found in .claude/evals/
Create evals by asking: "Create evals for [feature]"
Available evals:
auth Email/password authentication
checkout E-commerce checkout flow
```
### /eval show <name>
Display an eval spec:
Display the full eval spec.
```bash
cat ".claude/evals/$1.yaml"
### /eval build <name>
**The main command.** Orchestrates build → verify → fix loop.
```
/eval build auth
```
### /eval verify [name]
Flow:
1. Spawn **eval-builder** with building_spec
2. Builder implements, returns
3. Spawn **eval-verifier** with verification_spec
4. Verifier checks, returns pass/fail
5. If fail → spawn builder with failure context → goto 3
6. If pass → done
Run verification. This spawns the `eval-verifier` subagent.
**With name specified** (`/eval verify auth`):
Delegate to eval-verifier agent:
Output:
```
Run the eval-verifier agent to verify .claude/evals/auth.yaml
🔨 Building: auth
═══════════════════════════════════════
The agent should:
1. Read the eval spec
2. Run all checks in the verify list
3. Collect evidence for agent checks
4. Generate tests where generate_test: true
5. Report results with evidence
[Builder] Implementing...
+ src/auth/password.ts
+ src/auth/jwt.ts
+ src/routes/auth.ts
[Verifier] Checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
❌ api-login: Wrong status code
Expected: 401 on bad password
Actual: 500
[Builder] Fixing api-login...
~ src/routes/auth.ts
[Verifier] Re-checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
✅ api-login: Correct responses
📄 Test: tests/generated/test_auth_api_login.py
═══════════════════════════════════════
📊 Build complete: 3/3 checks passed
Iterations: 2
Tests generated: 1
```
**Without name** (`/eval verify`):
### /eval verify <name>
Just verify, don't build. For checking existing code.
```
/eval verify auth
```
Spawns verifier only. Reports pass/fail with evidence.
### /eval verify
Run all evals:
```
Run the eval-verifier agent to verify all evals in .claude/evals/
For each .yaml file:
1. Read the eval spec
2. Run all checks
3. Collect evidence
4. Generate tests
5. Report results
Summarize overall results at the end.
/eval verify
```
### /eval evidence <name>
Show collected evidence for an eval:
```bash
echo "Evidence for: $1"
echo ""
if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
cat ".claude/evals/.evidence/$1/evidence.json"
else
echo "No evidence collected yet. Run: /eval verify $1"
fi
Show collected evidence:
```
Evidence: auth
- api-login-001.png
- ui-login-001.png
- evidence.json
```
### /eval tests
List generated tests:
```bash
echo "Generated tests:"
echo ""
if [ -d "tests/generated" ]; then
ls -la tests/generated/
else
echo "No tests generated yet."
fi
```
Generated tests:
tests/generated/test_auth_api_login.py
tests/generated/test_auth_ui_login.py
```
### /eval clean
Clean evidence and generated tests:
Remove evidence and generated tests.
```bash
rm -rf .claude/evals/.evidence/
rm -rf tests/generated/
echo "Cleaned evidence and generated tests."
## Orchestration Logic
For `/eval build`:
```python
max_iterations = 5
iteration = 0
# Initial build
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "implement"
})
while iteration < max_iterations:
# Verify
verifier_result = spawn_agent("eval-verifier", {
"spec": f".claude/evals/{name}.yaml"
})
if verifier_result.all_passed:
return success(verifier_result)
# Fix failures
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "fix",
"failures": verifier_result.failures
})
iteration += 1
return failure("Max iterations reached")
```
## Workflow
## Context Flow
```
1. Create eval spec
> Create evals for user authentication
2. List evals
> /eval list
3. Show specific eval
> /eval show auth
4. Run verification
> /eval verify auth
5. Check evidence
> /eval evidence auth
6. Run generated tests
> pytest tests/generated/
Main Claude
├─→ Builder (context: building_spec only)
│ └─→ Returns: files created
├─→ Verifier (context: verification_spec only)
│ └─→ Returns: pass/fail + evidence
└─→ Builder (context: building_spec + failure only)
└─→ Returns: files fixed
```
## Output Examples
### /eval list
```
Available evals:
auth Email/password authentication with UI and API
todo-api REST API for todo management
checkout E-commerce checkout flow
```
### /eval verify auth
```
🔍 Eval: auth
═══════════════════════════════════════
Deterministic Checks:
✅ command: npm test -- --grep 'auth' (exit 0)
✅ file-contains: bcrypt in password.ts
Agent Checks:
✅ api-login: JWT returned correctly
📄 Test: tests/generated/test_auth_api_login.py
✅ ui-login: Dashboard redirect works
📸 Evidence: 2 screenshots
📄 Test: tests/generated/test_auth_ui_login.py
═══════════════════════════════════════
📊 Results: 4/4 passed
```
Each agent gets minimal, focused context. No bloat.

View file

@ -1,154 +1,48 @@
#!/bin/bash
set -euo pipefail
# Eval Skill Installer
# Installs the eval system: skill + verifier agent + command
echo "╔══════════════════════════════════════╗"
echo "║ Eval Skill Installer ║"
echo "╚══════════════════════════════════════╝"
echo ""
echo "eval-skill installer"
echo "===================="
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Parse args
INSTALL_GLOBAL=false
TARGET_DIR=".claude"
while [[ $# -gt 0 ]]; do
case $1 in
--global|-g)
INSTALL_GLOBAL=true
TARGET_DIR="$HOME/.claude"
shift
;;
--global|-g) TARGET_DIR="$HOME/.claude"; shift ;;
--help|-h)
echo "Usage: ./install.sh [OPTIONS]"
echo ""
echo "Options:"
echo "Usage: ./install.sh [--global]"
echo " --global, -g Install to ~/.claude (all projects)"
echo " --help, -h Show this help"
echo ""
echo "Default: Install to ./.claude (current project)"
exit 0
;;
*)
echo "Unknown option: $1"
exit 1
;;
echo " Default: ./.claude (current project)"
exit 0 ;;
*) echo "Unknown: $1"; exit 1 ;;
esac
done
if [ "$INSTALL_GLOBAL" = true ]; then
echo "📍 Installing globally: $TARGET_DIR"
else
echo "📍 Installing to project: $(pwd)/$TARGET_DIR"
fi
echo ""
echo "Installing to: $TARGET_DIR"
# Create directories
echo "Creating directories..."
# Create dirs
mkdir -p "$TARGET_DIR/skills/eval"
mkdir -p "$TARGET_DIR/commands"
mkdir -p "$TARGET_DIR/agents"
mkdir -p "$TARGET_DIR/evals"
# Install skill
echo "Installing eval skill..."
# Install files
cp "$SCRIPT_DIR/skills/eval/SKILL.md" "$TARGET_DIR/skills/eval/SKILL.md"
echo "$TARGET_DIR/skills/eval/SKILL.md"
# Install verifier agent
echo "Installing eval-verifier agent..."
cp "$SCRIPT_DIR/agents/eval-builder.md" "$TARGET_DIR/agents/eval-builder.md"
cp "$SCRIPT_DIR/agents/eval-verifier.md" "$TARGET_DIR/agents/eval-verifier.md"
echo "$TARGET_DIR/agents/eval-verifier.md"
# Install command
echo "Installing /eval command..."
cp "$SCRIPT_DIR/commands/eval.md" "$TARGET_DIR/commands/eval.md"
echo "$TARGET_DIR/commands/eval.md"
# Create example eval
if [ ! -f "$TARGET_DIR/evals/example.yaml" ]; then
echo "Creating example eval..."
cat > "$TARGET_DIR/evals/example.yaml" << 'EOF'
name: example
description: Example eval demonstrating the format
test_output:
framework: pytest
path: tests/generated/
verify:
# === DETERMINISTIC CHECKS ===
- type: file-exists
path: README.md
- type: command
run: "echo 'hello world'"
expect: exit_code 0
# === AGENT CHECKS ===
- type: agent
name: readme-quality
prompt: |
Read README.md and verify:
1. Has a title/heading
2. Explains what the project does
3. Has installation instructions
evidence:
- text: "# "
generate_test: false # Subjective, no test
EOF
echo "$TARGET_DIR/evals/example.yaml"
fi
# Check dependencies
echo "✓ Installed"
echo ""
echo "Checking optional dependencies..."
if command -v agent-browser &> /dev/null; then
echo " ✅ agent-browser installed"
else
echo " ⚠️ agent-browser not found (needed for UI testing)"
echo " npm install -g @anthropic/agent-browser"
fi
# Success
echo ""
echo "╔══════════════════════════════════════╗"
echo "║ Installation Complete ║"
echo "╚══════════════════════════════════════╝"
echo ""
echo "What was installed:"
echo ""
echo " 📋 Skill: eval"
echo " Generates eval specs (YAML)"
echo " Location: $TARGET_DIR/skills/eval/"
echo ""
echo " 🤖 Agent: eval-verifier"
echo " Runs checks, collects evidence, generates tests"
echo " Location: $TARGET_DIR/agents/"
echo ""
echo " ⌨️ Command: /eval"
echo " CLI: list | show | verify"
echo " Location: $TARGET_DIR/commands/"
echo ""
echo " 📁 Evals Directory: $TARGET_DIR/evals/"
echo " Your eval specs go here"
echo "Components:"
echo " Skill: $TARGET_DIR/skills/eval/"
echo " Builder: $TARGET_DIR/agents/eval-builder.md"
echo " Verifier: $TARGET_DIR/agents/eval-verifier.md"
echo " Command: $TARGET_DIR/commands/eval.md"
echo " Evals: $TARGET_DIR/evals/"
echo ""
echo "Usage:"
echo ""
echo " 1. Create evals:"
echo " > Create evals for user authentication"
echo ""
echo " 2. List evals:"
echo " > /eval list"
echo ""
echo " 3. Run verification:"
echo " > /eval verify auth"
echo ""
echo " 4. Run generated tests:"
echo " > pytest tests/generated/"
echo ""
echo " Create evals: 'Create evals for [feature]'"
echo " Build+verify: /eval build <name>"
echo " Verify only: /eval verify <name>"

View file

@ -1,280 +1,244 @@
---
name: eval
description: Generate evaluation specs for code verification. Use when setting up tests, defining acceptance criteria, or creating verification checkpoints before implementing features. Triggers on "create evals", "define acceptance criteria", "set up verification", or "how will we know this works".
description: Generate evaluation specs with building and verification criteria. Use when setting up features, defining acceptance criteria, or before implementing anything significant. Triggers on "create evals", "set up verification", "define acceptance criteria", or "build [feature]".
allowed-tools: Read, Grep, Glob, Write, Edit
---
# Eval Skill
Generate evaluation specs (YAML) that define what to verify. I do NOT run verification — that's the verifier agent's job.
Generate specs that define **what to build** and **how to verify it**.
## My Responsibilities
## Output
1. Understand what needs verification
2. Ask clarifying questions
3. Generate `.claude/evals/<name>.yaml` specs
4. Define checks with clear success criteria
I create `.claude/evals/<name>.yaml` with two sections:
## What I Do NOT Do
1. **building_spec** — What the builder agent implements
2. **verification_spec** — What the verifier agent checks
- Run tests or commands
- Collect evidence
- Generate test code
- Make pass/fail judgments
## Eval Spec Format
## Format
```yaml
name: feature-name
description: What this eval verifies
description: One-line summary
building_spec:
description: What to build
requirements:
- Requirement 1
- Requirement 2
constraints:
- Constraint 1
files:
- suggested/file/paths.ts
# Where generated tests should go
test_output:
framework: pytest # or vitest, jest
framework: pytest | vitest | jest
path: tests/generated/
verify:
# === DETERMINISTIC CHECKS ===
# These run as-is, fast and reliable
verification_spec:
# Deterministic checks
- type: command
run: "npm test -- --grep 'auth'"
run: "npm test"
expect: exit_code 0
- type: file-exists
path: src/auth/login.ts
- type: file-contains
path: src/auth/login.ts
pattern: "export function login"
- type: file-not-contains
path: src/config.ts
pattern: "API_KEY=sk-"
# === AGENT CHECKS ===
# Verifier agent runs these, collects evidence, generates tests
# Agent checks
- type: agent
name: login-flow # Used for evidence/test naming
name: check-name
prompt: |
Verify login with valid credentials:
1. Navigate to /login
2. Enter test@example.com / password123
3. Submit form
4. Verify redirect to /dashboard
5. Verify welcome message visible
What to verify
evidence:
- screenshot: after-login
- url: contains "/dashboard"
- element: '[data-testid="welcome"]'
generate_test: true # Verifier should write a test for this
```
## Check Types
### Deterministic (Verifier runs directly)
```yaml
# Command with exit code
- type: command
run: "pytest tests/auth/"
expect: exit_code 0
# Command with output check
- type: command
run: "curl -s localhost:3000/health"
expect:
contains: '"status":"ok"'
# File existence
- type: file-exists
path: src/feature.ts
# File content
- type: file-contains
path: src/feature.ts
pattern: "export function"
# Negative file content
- type: file-not-contains
path: .env.example
pattern: "real-api-key"
```
### Agent (Verifier interprets, collects evidence, may generate test)
```yaml
- type: agent
name: descriptive-name
prompt: |
Clear instructions for what to verify.
Be specific about:
- What to do
- What to check
- What success looks like
evidence:
- screenshot: step-name # Capture screenshot
- url: contains "pattern" # Check URL
- element: "selector" # Check element exists
- text: "expected text" # Check text visible
- response: status 200 # Check HTTP response
generate_test: true|false # Should verifier write a test?
- screenshot: name
- url: contains "pattern"
generate_test: true
```
## Workflow
### User asks to create evals
### User Request
**User**: Create evals for user authentication
```
Create evals for user authentication
```
**Me**: I'll define verification criteria. Questions:
1. Auth method? (email/password, OAuth, magic link?)
2. Should I verify UI, API, or both?
3. Security requirements? (password hashing, rate limiting?)
### My Questions
**User**: Email/password, both UI and API, must hash passwords
Before generating, I ask:
1. What auth method? (email/password, OAuth, magic link?)
2. UI, API, or both?
3. Specific security requirements?
**Me**: Creating `.claude/evals/auth.yaml`:
### My Output
`.claude/evals/auth.yaml`:
```yaml
name: auth
description: Email/password authentication with UI and API
building_spec:
description: |
User authentication system with email/password.
Secure password storage, JWT tokens, login/signup flows.
requirements:
- Password hashing with bcrypt (cost factor 12+)
- JWT tokens with 24h expiry
- POST /api/auth/login endpoint
- POST /api/auth/signup endpoint
- Login page at /login
- Signup page at /signup
- Protected route middleware
constraints:
- No plaintext passwords anywhere
- Tokens must be httpOnly cookies or secure headers
files:
- src/auth/password.ts
- src/auth/jwt.ts
- src/auth/middleware.ts
- src/routes/auth.ts
- src/pages/login.tsx
- src/pages/signup.tsx
test_output:
framework: pytest
path: tests/generated/
verify:
verification_spec:
# --- Deterministic ---
- type: command
run: "npm test -- --grep 'auth'"
run: "npm test -- --grep auth"
expect: exit_code 0
- type: file-contains
path: src/auth/password.ts
pattern: "bcrypt|argon2|scrypt"
pattern: "bcrypt"
- type: file-not-contains
path: src/auth/
path: src/
pattern: "password.*=.*plaintext"
# --- Agent: API Login ---
# --- Agent: API ---
- type: agent
name: api-login
prompt: |
Test login API:
1. POST /api/auth/login with {"email": "test@example.com", "password": "password123"}
2. Verify 200 response with JWT token
3. POST with wrong password
4. Verify 401 with helpful error message
1. POST /api/auth/signup with new user
2. Verify 201 response
3. POST /api/auth/login with same creds
4. Verify 200 with JWT token
5. POST /api/auth/login with wrong password
6. Verify 401 with helpful message
evidence:
- response: status 201
- response: status 200
- response: has "token"
- response: status 401
generate_test: true
# --- Agent: UI Login ---
# --- Agent: UI ---
- type: agent
name: ui-login
prompt: |
Test login UI:
1. Go to /login
2. Enter valid credentials
3. Submit
2. Verify form has email + password fields
3. Submit with valid credentials
4. Verify redirect to /dashboard
5. Verify welcome message
5. Verify welcome message visible
evidence:
- screenshot: login-page
- screenshot: after-submit
- screenshot: after-login
- url: contains "/dashboard"
- element: '[data-testid="welcome"]'
generate_test: true
# --- Agent: Error Handling ---
# --- Agent: Security ---
- type: agent
name: login-errors
name: password-security
prompt: |
Test error handling:
1. Submit empty form - verify field validation
2. Submit wrong password - verify error message is helpful
3. Verify password field clears but email persists
Verify password security:
1. Read src/auth/password.ts
2. Confirm bcrypt with cost >= 12
3. Confirm no password logging
4. Check signup doesn't echo password
evidence:
- screenshot: validation-errors
- screenshot: wrong-password
- text: "Invalid email or password"
generate_test: true
- text: "bcrypt"
- text: "cost" or "rounds"
generate_test: false # Code review, not repeatable test
```
Now run `/eval verify auth` to have the verifier agent:
1. Run deterministic checks
2. Execute agent checks with evidence collection
3. Generate tests in `tests/generated/`
4. Report results
## Check Types
### Deterministic
```yaml
- type: command
run: "shell command"
expect: exit_code 0
- type: command
run: "curl localhost:3000/health"
expect:
contains: '"ok"'
- type: file-exists
path: src/file.ts
- type: file-contains
path: src/file.ts
pattern: "regex pattern"
- type: file-not-contains
path: src/file.ts
pattern: "bad pattern"
```
### Agent
```yaml
- type: agent
name: descriptive-name # Used for evidence/test naming
prompt: |
Step-by-step verification
evidence:
- screenshot: step-name
- url: contains "pattern"
- element: "css-selector"
- text: "expected text"
- response: status 200
- response: has "field"
generate_test: true | false
```
## Best Practices
### Be Specific in Prompts
```yaml
# ❌ Vague
prompt: "Make sure login works"
### Building Spec
# ✅ Specific
prompt: |
1. Navigate to /login
2. Enter test@example.com in email field
3. Enter password123 in password field
4. Click submit button
5. Verify URL is /dashboard
6. Verify text "Welcome" is visible
```
- **Be specific** — "bcrypt with cost 12" not "secure passwords"
- **List files** — helps builder know where to put code
- **State constraints** — what NOT to do matters
### Specify Evidence
```yaml
# ❌ No evidence
- type: agent
prompt: "Check the UI looks right"
### Verification Spec
# ✅ Evidence defined
- type: agent
prompt: "Check login form has email and password fields"
evidence:
- screenshot: login-form
- element: 'input[type="email"]'
- element: 'input[type="password"]'
```
- **Deterministic first** — fast, reliable checks
- **Agent for semantics** — UI flows, code quality, error messages
- **Evidence always** — no claim without proof
- **generate_test for repeatables** — UI flows yes, code review no
### Enable Test Generation for Repeatables
```yaml
# UI flows → generate tests (repeatable)
- type: agent
name: checkout-flow
generate_test: true
### Naming
# Subjective review → no test (human judgment)
- type: agent
name: code-quality
generate_test: false
prompt: "Review error messages for helpfulness"
```
- `name: feature-name` — lowercase, hyphens
- `name: api-login` — for agent checks, descriptive
## Directory Structure
## What Happens Next
After running evals:
After I create the spec:
```
.claude/
├── evals/
│ ├── auth.yaml # Eval spec (I create this)
│ └── .evidence/
│ ├── auth/
│ │ ├── ui-login-001.png
│ │ ├── ui-login-002.png
│ │ └── evidence.json # Structured evidence
│ └── ...
tests/
└── generated/
├── test_auth_api_login.py # Verifier generates
├── test_auth_ui_login.py # Verifier generates
└── ...
/eval build auth
```
1. Builder agent reads `building_spec`, implements
2. Verifier agent reads `verification_spec`, checks
3. If fail → builder gets feedback → fixes → verifier re-checks
4. Loop until pass
5. Agent checks become tests in `tests/generated/`