From aca2126c88cf52b8e32aee595046c36cf92ee961 Mon Sep 17 00:00:00 2001 From: Harivansh Rathi Date: Wed, 14 Jan 2026 00:07:28 -0800 Subject: [PATCH] init --- .gitignore | 3 + README.md | 293 +++++++++++++++++++++++++++++++++++ agents/eval-verifier.md | 334 ++++++++++++++++++++++++++++++++++++++++ commands/eval.md | 169 ++++++++++++++++++++ install.sh | 154 ++++++++++++++++++ skills/eval/SKILL.md | 280 +++++++++++++++++++++++++++++++++ 6 files changed, 1233 insertions(+) create mode 100644 .gitignore create mode 100644 README.md create mode 100644 agents/eval-verifier.md create mode 100644 commands/eval.md create mode 100755 install.sh create mode 100644 skills/eval/SKILL.md diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..aafcb34 --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +node_modules/ +.DS_Store +*.log diff --git a/README.md b/README.md new file mode 100644 index 0000000..3ec2ce0 --- /dev/null +++ b/README.md @@ -0,0 +1,293 @@ +# eval-skill + +Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work. + +## The Problem + +> *"How will the agent know it did the right thing?"* +> — [Thorsten Ball](https://x.com/thorstenball) + +Without verification, Claude implements and hopes. With verification, Claude implements and **knows**. + +## The Solution + +``` +┌─────────────────────────────────────────────────────────────┐ +│ 1. SKILL: eval │ +│ "Create evals for auth" │ +│ → Generates .claude/evals/auth.yaml │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ 2. AGENT: eval-verifier │ +│ "/eval verify auth" │ +│ → Runs checks │ +│ → Collects evidence (screenshots, outputs) │ +│ → Generates executable tests │ +│ → Reports pass/fail │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ 3. OUTPUT │ +│ .claude/evals/.evidence/auth/ ← Screenshots, logs │ +│ tests/generated/test_auth.py ← Executable tests │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Install + +```bash +git clone https://github.com/yourusername/eval-skill.git +cd eval-skill + +# Install to current project +./install.sh + +# Or install globally (all projects) +./install.sh --global +``` + +## Usage + +### 1. Create Evals (Before Implementation) + +``` +> Create evals for user authentication +``` + +Claude generates `.claude/evals/auth.yaml`: + +```yaml +name: auth +description: Email/password authentication + +test_output: + framework: pytest + path: tests/generated/ + +verify: + # Deterministic + - type: command + run: "npm test -- --grep 'auth'" + expect: exit_code 0 + + - type: file-contains + path: src/auth/password.ts + pattern: "bcrypt|argon2" + + # Agent-based (with evidence + test generation) + - type: agent + name: ui-login + prompt: | + 1. Go to /login + 2. Enter test@example.com / password123 + 3. Submit + 4. Verify redirect to /dashboard + evidence: + - screenshot: after-login + - url: contains "/dashboard" + generate_test: true +``` + +### 2. Implement + +``` +> Implement auth based on .claude/evals/auth.yaml +``` + +### 3. Verify + +``` +> /eval verify auth +``` + +Output: + +``` +🔍 Eval: auth +═══════════════════════════════════════ + +Deterministic: + ✅ command: npm test (exit 0) + ✅ file-contains: bcrypt in password.ts + +Agent: + ✅ ui-login: Dashboard redirect works + 📸 Evidence: 2 screenshots saved + 📄 Test: tests/generated/test_auth_ui_login.py + +═══════════════════════════════════════ +📊 Results: 3/3 passed +``` + +### 4. Run Generated Tests (Forever) + +```bash +pytest tests/generated/ +``` + +The agent converted its semantic verification into deterministic tests. + +## How It Works + +### Non-Deterministic → Deterministic + +Agent checks are semantic: "verify login works." But we need proof. + +1. **Verifier runs the check** (browser automation, API calls, file inspection) +2. **Collects evidence** (screenshots, responses, DOM snapshots) +3. **Generates executable test** (pytest/vitest) +4. **Future runs use the test** (no agent needed) + +``` +Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable) + ↓ ↓ ↓ +"Login works" screenshot + url check pytest + playwright +``` + +### Evidence-Based Verification + +The verifier can't just say "pass." It must provide evidence: + +```yaml +- type: agent + name: login-flow + prompt: "Verify login redirects to dashboard" + evidence: + - screenshot: login-page + - screenshot: after-submit + - url: contains "/dashboard" + - element: '[data-testid="welcome"]' +``` + +Evidence is saved to `.claude/evals/.evidence//`: + +```json +{ + "eval": "auth", + "checks": [{ + "name": "login-flow", + "pass": true, + "evidence": [ + {"type": "screenshot", "path": "login-page.png"}, + {"type": "screenshot", "path": "after-submit.png"}, + {"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"}, + {"type": "element", "selector": "[data-testid=welcome]", "found": true} + ] + }] +} +``` + +## Check Types + +### Deterministic (Fast, No Agent) + +```yaml +# Command + exit code +- type: command + run: "pytest tests/" + expect: exit_code 0 + +# Command + output +- type: command + run: "curl localhost:3000/health" + expect: + contains: '"status":"ok"' + +# File exists +- type: file-exists + path: src/feature.ts + +# File contains pattern +- type: file-contains + path: src/auth.ts + pattern: "bcrypt" + +# File does NOT contain +- type: file-not-contains + path: .env + pattern: "sk-" +``` + +### Agent (Semantic, Evidence-Based) + +```yaml +- type: agent + name: descriptive-name + prompt: | + Step-by-step verification instructions + evidence: + - screenshot: step-name + - url: contains "pattern" + - element: "css-selector" + - text: "expected text" + - response: status 200 + generate_test: true # Write executable test +``` + +## Commands + +| Command | Description | +|---------|-------------| +| `/eval list` | List all evals | +| `/eval show ` | Display eval spec | +| `/eval verify ` | Run verification | +| `/eval verify` | Run all evals | +| `/eval evidence ` | Show collected evidence | +| `/eval tests` | List generated tests | +| `/eval clean` | Remove evidence + generated tests | + +## Directory Structure + +``` +.claude/ +├── skills/eval/SKILL.md # Eval generation skill +├── agents/eval-verifier.md # Verification agent +├── commands/eval.md # /eval command +└── evals/ + ├── auth.yaml # Your eval specs + ├── checkout.yaml + └── .evidence/ + ├── auth/ + │ ├── evidence.json + │ └── *.png + └── checkout/ + └── ... + +tests/ +└── generated/ # Tests written by verifier + ├── test_auth_ui_login.py + └── test_auth_api_login.py +``` + +## Requirements + +- Claude Code with skills/agents/commands support +- For UI testing: `npm install -g @anthropic/agent-browser` + +## Philosophy + +**TDD for Agents:** + +| Traditional TDD | Agent TDD | +|----------------|-----------| +| Write tests | Write evals | +| Write code | Claude writes code | +| Tests pass | Claude verifies + generates tests | + +**Why generate tests?** + +Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed. + +**Mix deterministic and semantic:** + +- Deterministic: "tests pass", "file exists", "command succeeds" +- Semantic: "UI looks right", "error is helpful", "code is readable" + +Use deterministic where possible, semantic where necessary. + +## License + +MIT diff --git a/agents/eval-verifier.md b/agents/eval-verifier.md new file mode 100644 index 0000000..5a1ee56 --- /dev/null +++ b/agents/eval-verifier.md @@ -0,0 +1,334 @@ +--- +name: eval-verifier +description: Verification agent that runs eval checks, collects evidence, and generates tests. Use when running /eval verify. +tools: Read, Grep, Glob, Bash, Write, Edit +model: sonnet +permissionMode: acceptEdits +--- + +# Eval Verifier Agent + +I run verification checks from eval specs, collect evidence, and generate executable tests. + +## My Responsibilities + +1. Read eval spec YAML +2. Run each check in order +3. Collect evidence for agent checks +4. Generate test files when `generate_test: true` +5. Report pass/fail with evidence + +## What I Do NOT Do + +- Create or modify eval specs (that's the eval skill) +- Skip checks or take shortcuts +- Claim pass without evidence + +## Verification Process + +``` +Read spec → Run checks → Collect evidence → Generate tests → Report +``` + +### Step 1: Parse Eval Spec + +Read `.claude/evals/.yaml` and extract: +- `name`: Eval name +- `test_output`: Where to write generated tests +- `verify`: List of checks + +### Step 2: Run Deterministic Checks + +For `type: command`, `file-exists`, `file-contains`, `file-not-contains`: + +```bash +# command +result=$(eval "$run_command") +exit_code=$? +# Compare against expect + +# file-exists +test -f "$path" + +# file-contains +grep -q "$pattern" "$path" + +# file-not-contains +! grep -q "$pattern" "$path" +``` + +### Step 3: Run Agent Checks + +For `type: agent`: + +1. **Read the prompt** carefully +2. **Execute steps** using available tools +3. **Collect evidence** as specified +4. **Determine pass/fail** based on evidence +5. **Generate test** if `generate_test: true` + +## Evidence Collection + +Evidence goes in `.claude/evals/.evidence//` + +### Screenshots + +```bash +agent-browser screenshot --name "step-name" +# Saved to .claude/evals/.evidence//.png +``` + +### URL Checks + +```bash +url=$(agent-browser url) +# Verify: contains "/dashboard" +``` + +### Element Checks + +```bash +agent-browser snapshot +# Parse snapshot for selector +``` + +### HTTP Response + +```bash +response=$(curl -s -w "\n%{http_code}" "http://localhost:3000/api/endpoint") +body=$(echo "$response" | head -n -1) +status=$(echo "$response" | tail -1) +``` + +### Evidence Manifest + +Write `.claude/evals/.evidence//evidence.json`: + +```json +{ + "eval": "auth", + "timestamp": "2024-01-15T10:30:00Z", + "checks": [ + { + "name": "ui-login", + "type": "agent", + "pass": true, + "evidence": [ + {"type": "screenshot", "path": "ui-login-001.png", "step": "login-page"}, + {"type": "screenshot", "path": "ui-login-002.png", "step": "after-submit"}, + {"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"}, + {"type": "element", "selector": "[data-testid=welcome]", "found": true} + ] + } + ] +} +``` + +## Test Generation + +When `generate_test: true`, I write an executable test based on my verification steps. + +### Determine Framework + +From `test_output.framework` in eval spec: +- `pytest` → Python with playwright +- `vitest` → TypeScript with playwright +- `jest` → JavaScript with puppeteer + +### Python/Pytest Example + +```python +# tests/generated/test_auth_ui_login.py +# Generated from: .claude/evals/auth.yaml +# Check: ui-login +# Generated: 2024-01-15T10:30:00Z + +import pytest +from playwright.sync_api import sync_playwright, expect + +@pytest.fixture +def browser(): + with sync_playwright() as p: + browser = p.chromium.launch() + yield browser + browser.close() + +def test_ui_login(browser): + """ + Verify login with valid credentials: + 1. Navigate to /login + 2. Enter test@example.com / password123 + 3. Submit form + 4. Verify redirect to /dashboard + 5. Verify welcome message visible + """ + page = browser.new_page() + + # Step 1: Navigate to /login + page.goto("http://localhost:3000/login") + + # Step 2: Enter credentials + page.fill('input[type="email"]', "test@example.com") + page.fill('input[type="password"]', "password123") + + # Step 3: Submit form + page.click('button[type="submit"]') + + # Step 4: Verify redirect to /dashboard + page.wait_for_url("**/dashboard") + assert "/dashboard" in page.url + + # Step 5: Verify welcome message visible + expect(page.locator('[data-testid="welcome"]')).to_be_visible() +``` + +### TypeScript/Vitest Example + +```typescript +// tests/generated/auth-ui-login.test.ts +// Generated from: .claude/evals/auth.yaml + +import { test, expect } from '@playwright/test'; + +test('ui-login: valid credentials redirect to dashboard', async ({ page }) => { + await page.goto('http://localhost:3000/login'); + + await page.fill('input[type="email"]', 'test@example.com'); + await page.fill('input[type="password"]', 'password123'); + await page.click('button[type="submit"]'); + + await page.waitForURL('**/dashboard'); + expect(page.url()).toContain('/dashboard'); + + await expect(page.locator('[data-testid="welcome"]')).toBeVisible(); +}); +``` + +### API Test Example + +```python +# tests/generated/test_auth_api_login.py + +import pytest +import requests + +def test_api_login_success(): + """POST /api/auth/login with valid credentials returns JWT""" + response = requests.post( + "http://localhost:3000/api/auth/login", + json={"email": "test@example.com", "password": "password123"} + ) + + assert response.status_code == 200 + data = response.json() + assert "token" in data + +def test_api_login_wrong_password(): + """POST /api/auth/login with wrong password returns 401""" + response = requests.post( + "http://localhost:3000/api/auth/login", + json={"email": "test@example.com", "password": "wrongpassword"} + ) + + assert response.status_code == 401 + data = response.json() + assert "error" in data +``` + +## Output Format + +### Per-Check Output + +``` +✅ [type] name: description + Evidence: screenshot saved, url matched, element found + +❌ [type] name: description + Expected: /dashboard in URL + Actual: /login (still on login page) + Evidence: screenshot at .claude/evals/.evidence/auth/ui-login-fail.png +``` + +### Summary + +``` +🔍 Eval: auth +═══════════════════════════════════════ + +Deterministic Checks: + ✅ command: npm test -- --grep 'auth' (exit 0) + ✅ file-contains: src/auth/password.ts has bcrypt + ✅ file-not-contains: no plaintext passwords + +Agent Checks: + ✅ api-login: JWT returned on valid credentials + 📄 Test generated: tests/generated/test_auth_api_login.py + ✅ ui-login: Redirect to dashboard with welcome message + 📸 Evidence: 2 screenshots saved + 📄 Test generated: tests/generated/test_auth_ui_login.py + ❌ login-errors: Error message not helpful + Expected: "Invalid email or password. Please try again." + Actual: "Error 401" + 📸 Evidence: .claude/evals/.evidence/auth/login-errors-001.png + +═══════════════════════════════════════ +📊 Results: 5/6 passed + +Tests Generated: + - tests/generated/test_auth_api_login.py + - tests/generated/test_auth_ui_login.py + +Evidence: + - .claude/evals/.evidence/auth/evidence.json + - .claude/evals/.evidence/auth/*.png (4 files) + +Next Steps: + - Fix error message handling (login-errors check failed) + - Run generated tests: pytest tests/generated/ +``` + +## Browser Commands + +Using `agent-browser` CLI: + +```bash +# Navigate +agent-browser goto "http://localhost:3000/login" + +# Fill form +agent-browser fill "email" "test@example.com" +agent-browser fill "password" "password123" + +# Click +agent-browser click "Login" +agent-browser click "button[type=submit]" + +# Get current URL +agent-browser url + +# Get page snapshot (accessibility tree) +agent-browser snapshot + +# Screenshot +agent-browser screenshot +agent-browser screenshot --name "after-login" + +# Check element exists +agent-browser text "[data-testid=welcome]" +``` + +## Error Handling + +- **Command fails**: Report failure with stderr, continue other checks +- **File not found**: Fail the check, note in evidence +- **Browser not available**: Suggest installation, skip browser checks +- **Timeout**: Fail with timeout evidence, continue +- **Always**: Complete all checks, never stop early + +## Important Rules + +1. **Evidence for every claim** — No "pass" without proof +2. **Generate tests when asked** — If `generate_test: true`, write the test +3. **Be thorough** — Run every check in the spec +4. **Be honest** — If it fails, say so with evidence +5. **Don't modify source code** — Only verify, never fix diff --git a/commands/eval.md b/commands/eval.md new file mode 100644 index 0000000..21b05dd --- /dev/null +++ b/commands/eval.md @@ -0,0 +1,169 @@ +--- +description: Run eval commands - list, show, or verify evals +argument-hint: list | show | verify [name] +allowed-tools: Read, Bash, Task +--- + +# /eval Command + +Interface for the eval system. I dispatch to the right action. + +## Commands + +### /eval list + +List all eval specs: + +```bash +echo "Available evals:" +echo "" +for f in .claude/evals/*.yaml 2>/dev/null; do + if [ -f "$f" ]; then + name=$(basename "$f" .yaml) + desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//') + printf " %-20s %s\n" "$name" "$desc" + fi +done +``` + +If no evals exist: +``` +No evals found in .claude/evals/ + +Create evals by asking: "Create evals for [feature]" +``` + +### /eval show + +Display an eval spec: + +```bash +cat ".claude/evals/$1.yaml" +``` + +### /eval verify [name] + +Run verification. This spawns the `eval-verifier` subagent. + +**With name specified** (`/eval verify auth`): + +Delegate to eval-verifier agent: +``` +Run the eval-verifier agent to verify .claude/evals/auth.yaml + +The agent should: +1. Read the eval spec +2. Run all checks in the verify list +3. Collect evidence for agent checks +4. Generate tests where generate_test: true +5. Report results with evidence +``` + +**Without name** (`/eval verify`): + +Run all evals: +``` +Run the eval-verifier agent to verify all evals in .claude/evals/ + +For each .yaml file: +1. Read the eval spec +2. Run all checks +3. Collect evidence +4. Generate tests +5. Report results + +Summarize overall results at the end. +``` + +### /eval evidence + +Show collected evidence for an eval: + +```bash +echo "Evidence for: $1" +echo "" +if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then + cat ".claude/evals/.evidence/$1/evidence.json" +else + echo "No evidence collected yet. Run: /eval verify $1" +fi +``` + +### /eval tests + +List generated tests: + +```bash +echo "Generated tests:" +echo "" +if [ -d "tests/generated" ]; then + ls -la tests/generated/ +else + echo "No tests generated yet." +fi +``` + +### /eval clean + +Clean evidence and generated tests: + +```bash +rm -rf .claude/evals/.evidence/ +rm -rf tests/generated/ +echo "Cleaned evidence and generated tests." +``` + +## Workflow + +``` +1. Create eval spec + > Create evals for user authentication + +2. List evals + > /eval list + +3. Show specific eval + > /eval show auth + +4. Run verification + > /eval verify auth + +5. Check evidence + > /eval evidence auth + +6. Run generated tests + > pytest tests/generated/ +``` + +## Output Examples + +### /eval list + +``` +Available evals: + + auth Email/password authentication with UI and API + todo-api REST API for todo management + checkout E-commerce checkout flow +``` + +### /eval verify auth + +``` +🔍 Eval: auth +═══════════════════════════════════════ + +Deterministic Checks: + ✅ command: npm test -- --grep 'auth' (exit 0) + ✅ file-contains: bcrypt in password.ts + +Agent Checks: + ✅ api-login: JWT returned correctly + 📄 Test: tests/generated/test_auth_api_login.py + ✅ ui-login: Dashboard redirect works + 📸 Evidence: 2 screenshots + 📄 Test: tests/generated/test_auth_ui_login.py + +═══════════════════════════════════════ +📊 Results: 4/4 passed +``` diff --git a/install.sh b/install.sh new file mode 100755 index 0000000..1ddb33a --- /dev/null +++ b/install.sh @@ -0,0 +1,154 @@ +#!/bin/bash +set -euo pipefail + +# Eval Skill Installer +# Installs the eval system: skill + verifier agent + command + +echo "╔══════════════════════════════════════╗" +echo "║ Eval Skill Installer ║" +echo "╚══════════════════════════════════════╝" +echo "" + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Parse args +INSTALL_GLOBAL=false +TARGET_DIR=".claude" + +while [[ $# -gt 0 ]]; do + case $1 in + --global|-g) + INSTALL_GLOBAL=true + TARGET_DIR="$HOME/.claude" + shift + ;; + --help|-h) + echo "Usage: ./install.sh [OPTIONS]" + echo "" + echo "Options:" + echo " --global, -g Install to ~/.claude (all projects)" + echo " --help, -h Show this help" + echo "" + echo "Default: Install to ./.claude (current project)" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +if [ "$INSTALL_GLOBAL" = true ]; then + echo "📍 Installing globally: $TARGET_DIR" +else + echo "📍 Installing to project: $(pwd)/$TARGET_DIR" +fi +echo "" + +# Create directories +echo "Creating directories..." +mkdir -p "$TARGET_DIR/skills/eval" +mkdir -p "$TARGET_DIR/commands" +mkdir -p "$TARGET_DIR/agents" +mkdir -p "$TARGET_DIR/evals" + +# Install skill +echo "Installing eval skill..." +cp "$SCRIPT_DIR/skills/eval/SKILL.md" "$TARGET_DIR/skills/eval/SKILL.md" +echo " ✅ $TARGET_DIR/skills/eval/SKILL.md" + +# Install verifier agent +echo "Installing eval-verifier agent..." +cp "$SCRIPT_DIR/agents/eval-verifier.md" "$TARGET_DIR/agents/eval-verifier.md" +echo " ✅ $TARGET_DIR/agents/eval-verifier.md" + +# Install command +echo "Installing /eval command..." +cp "$SCRIPT_DIR/commands/eval.md" "$TARGET_DIR/commands/eval.md" +echo " ✅ $TARGET_DIR/commands/eval.md" + +# Create example eval +if [ ! -f "$TARGET_DIR/evals/example.yaml" ]; then + echo "Creating example eval..." + cat > "$TARGET_DIR/evals/example.yaml" << 'EOF' +name: example +description: Example eval demonstrating the format + +test_output: + framework: pytest + path: tests/generated/ + +verify: + # === DETERMINISTIC CHECKS === + + - type: file-exists + path: README.md + + - type: command + run: "echo 'hello world'" + expect: exit_code 0 + + # === AGENT CHECKS === + + - type: agent + name: readme-quality + prompt: | + Read README.md and verify: + 1. Has a title/heading + 2. Explains what the project does + 3. Has installation instructions + evidence: + - text: "# " + generate_test: false # Subjective, no test +EOF + echo " ✅ $TARGET_DIR/evals/example.yaml" +fi + +# Check dependencies +echo "" +echo "Checking optional dependencies..." +if command -v agent-browser &> /dev/null; then + echo " ✅ agent-browser installed" +else + echo " ⚠️ agent-browser not found (needed for UI testing)" + echo " npm install -g @anthropic/agent-browser" +fi + +# Success +echo "" +echo "╔══════════════════════════════════════╗" +echo "║ Installation Complete ║" +echo "╚══════════════════════════════════════╝" +echo "" +echo "What was installed:" +echo "" +echo " 📋 Skill: eval" +echo " Generates eval specs (YAML)" +echo " Location: $TARGET_DIR/skills/eval/" +echo "" +echo " 🤖 Agent: eval-verifier" +echo " Runs checks, collects evidence, generates tests" +echo " Location: $TARGET_DIR/agents/" +echo "" +echo " ⌨️ Command: /eval" +echo " CLI: list | show | verify" +echo " Location: $TARGET_DIR/commands/" +echo "" +echo " 📁 Evals Directory: $TARGET_DIR/evals/" +echo " Your eval specs go here" +echo "" +echo "Usage:" +echo "" +echo " 1. Create evals:" +echo " > Create evals for user authentication" +echo "" +echo " 2. List evals:" +echo " > /eval list" +echo "" +echo " 3. Run verification:" +echo " > /eval verify auth" +echo "" +echo " 4. Run generated tests:" +echo " > pytest tests/generated/" +echo "" diff --git a/skills/eval/SKILL.md b/skills/eval/SKILL.md new file mode 100644 index 0000000..792bf45 --- /dev/null +++ b/skills/eval/SKILL.md @@ -0,0 +1,280 @@ +--- +name: eval +description: Generate evaluation specs for code verification. Use when setting up tests, defining acceptance criteria, or creating verification checkpoints before implementing features. Triggers on "create evals", "define acceptance criteria", "set up verification", or "how will we know this works". +allowed-tools: Read, Grep, Glob, Write, Edit +--- + +# Eval Skill + +Generate evaluation specs (YAML) that define what to verify. I do NOT run verification — that's the verifier agent's job. + +## My Responsibilities + +1. Understand what needs verification +2. Ask clarifying questions +3. Generate `.claude/evals/.yaml` specs +4. Define checks with clear success criteria + +## What I Do NOT Do + +- Run tests or commands +- Collect evidence +- Generate test code +- Make pass/fail judgments + +## Eval Spec Format + +```yaml +name: feature-name +description: What this eval verifies + +# Where generated tests should go +test_output: + framework: pytest # or vitest, jest + path: tests/generated/ + +verify: + # === DETERMINISTIC CHECKS === + # These run as-is, fast and reliable + + - type: command + run: "npm test -- --grep 'auth'" + expect: exit_code 0 + + - type: file-exists + path: src/auth/login.ts + + - type: file-contains + path: src/auth/login.ts + pattern: "export function login" + + - type: file-not-contains + path: src/config.ts + pattern: "API_KEY=sk-" + + # === AGENT CHECKS === + # Verifier agent runs these, collects evidence, generates tests + + - type: agent + name: login-flow # Used for evidence/test naming + prompt: | + Verify login with valid credentials: + 1. Navigate to /login + 2. Enter test@example.com / password123 + 3. Submit form + 4. Verify redirect to /dashboard + 5. Verify welcome message visible + evidence: + - screenshot: after-login + - url: contains "/dashboard" + - element: '[data-testid="welcome"]' + generate_test: true # Verifier should write a test for this +``` + +## Check Types + +### Deterministic (Verifier runs directly) + +```yaml +# Command with exit code +- type: command + run: "pytest tests/auth/" + expect: exit_code 0 + +# Command with output check +- type: command + run: "curl -s localhost:3000/health" + expect: + contains: '"status":"ok"' + +# File existence +- type: file-exists + path: src/feature.ts + +# File content +- type: file-contains + path: src/feature.ts + pattern: "export function" + +# Negative file content +- type: file-not-contains + path: .env.example + pattern: "real-api-key" +``` + +### Agent (Verifier interprets, collects evidence, may generate test) + +```yaml +- type: agent + name: descriptive-name + prompt: | + Clear instructions for what to verify. + Be specific about: + - What to do + - What to check + - What success looks like + evidence: + - screenshot: step-name # Capture screenshot + - url: contains "pattern" # Check URL + - element: "selector" # Check element exists + - text: "expected text" # Check text visible + - response: status 200 # Check HTTP response + generate_test: true|false # Should verifier write a test? +``` + +## Workflow + +### User asks to create evals + +**User**: Create evals for user authentication + +**Me**: I'll define verification criteria. Questions: +1. Auth method? (email/password, OAuth, magic link?) +2. Should I verify UI, API, or both? +3. Security requirements? (password hashing, rate limiting?) + +**User**: Email/password, both UI and API, must hash passwords + +**Me**: Creating `.claude/evals/auth.yaml`: + +```yaml +name: auth +description: Email/password authentication with UI and API + +test_output: + framework: pytest + path: tests/generated/ + +verify: + # --- Deterministic --- + - type: command + run: "npm test -- --grep 'auth'" + expect: exit_code 0 + + - type: file-contains + path: src/auth/password.ts + pattern: "bcrypt|argon2|scrypt" + + - type: file-not-contains + path: src/auth/ + pattern: "password.*=.*plaintext" + + # --- Agent: API Login --- + - type: agent + name: api-login + prompt: | + Test login API: + 1. POST /api/auth/login with {"email": "test@example.com", "password": "password123"} + 2. Verify 200 response with JWT token + 3. POST with wrong password + 4. Verify 401 with helpful error message + evidence: + - response: status 200 + - response: has "token" + - response: status 401 + generate_test: true + + # --- Agent: UI Login --- + - type: agent + name: ui-login + prompt: | + Test login UI: + 1. Go to /login + 2. Enter valid credentials + 3. Submit + 4. Verify redirect to /dashboard + 5. Verify welcome message + evidence: + - screenshot: login-page + - screenshot: after-submit + - url: contains "/dashboard" + - element: '[data-testid="welcome"]' + generate_test: true + + # --- Agent: Error Handling --- + - type: agent + name: login-errors + prompt: | + Test error handling: + 1. Submit empty form - verify field validation + 2. Submit wrong password - verify error message is helpful + 3. Verify password field clears but email persists + evidence: + - screenshot: validation-errors + - screenshot: wrong-password + - text: "Invalid email or password" + generate_test: true +``` + +Now run `/eval verify auth` to have the verifier agent: +1. Run deterministic checks +2. Execute agent checks with evidence collection +3. Generate tests in `tests/generated/` +4. Report results + +## Best Practices + +### Be Specific in Prompts +```yaml +# ❌ Vague +prompt: "Make sure login works" + +# ✅ Specific +prompt: | + 1. Navigate to /login + 2. Enter test@example.com in email field + 3. Enter password123 in password field + 4. Click submit button + 5. Verify URL is /dashboard + 6. Verify text "Welcome" is visible +``` + +### Specify Evidence +```yaml +# ❌ No evidence +- type: agent + prompt: "Check the UI looks right" + +# ✅ Evidence defined +- type: agent + prompt: "Check login form has email and password fields" + evidence: + - screenshot: login-form + - element: 'input[type="email"]' + - element: 'input[type="password"]' +``` + +### Enable Test Generation for Repeatables +```yaml +# UI flows → generate tests (repeatable) +- type: agent + name: checkout-flow + generate_test: true + +# Subjective review → no test (human judgment) +- type: agent + name: code-quality + generate_test: false + prompt: "Review error messages for helpfulness" +``` + +## Directory Structure + +After running evals: + +``` +.claude/ +├── evals/ +│ ├── auth.yaml # Eval spec (I create this) +│ └── .evidence/ +│ ├── auth/ +│ │ ├── ui-login-001.png +│ │ ├── ui-login-002.png +│ │ └── evidence.json # Structured evidence +│ └── ... +tests/ +└── generated/ + ├── test_auth_api_login.py # Verifier generates + ├── test_auth_ui_login.py # Verifier generates + └── ... +```