harivansh-afk/eval-skill

Fork 0

mirror of https://github.com/harivansh-afk/eval-skill.git synced 2026-04-15 13:03:44 +00:00

Harivansh Rathi aca2126c88 init

2026-01-14 00:07:28 -08:00

8.1 KiB

Raw Blame History

eval-skill

Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.

The Problem

"How will the agent know it did the right thing?" — Thorsten Ball

Without verification, Claude implements and hopes. With verification, Claude implements and knows.

The Solution

┌─────────────────────────────────────────────────────────────┐
│  1. SKILL: eval                                             │
│     "Create evals for auth"                                 │
│     → Generates .claude/evals/auth.yaml                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  2. AGENT: eval-verifier                                    │
│     "/eval verify auth"                                     │
│     → Runs checks                                           │
│     → Collects evidence (screenshots, outputs)              │
│     → Generates executable tests                            │
│     → Reports pass/fail                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  3. OUTPUT                                                  │
│     .claude/evals/.evidence/auth/  ← Screenshots, logs      │
│     tests/generated/test_auth.py   ← Executable tests       │
└─────────────────────────────────────────────────────────────┘

Install

git clone https://github.com/yourusername/eval-skill.git
cd eval-skill

# Install to current project
./install.sh

# Or install globally (all projects)
./install.sh --global

Usage

1. Create Evals (Before Implementation)

> Create evals for user authentication

Claude generates .claude/evals/auth.yaml:

name: auth
description: Email/password authentication

test_output:
  framework: pytest
  path: tests/generated/

verify:
  # Deterministic
  - type: command
    run: "npm test -- --grep 'auth'"
    expect: exit_code 0
    
  - type: file-contains
    path: src/auth/password.ts
    pattern: "bcrypt|argon2"

  # Agent-based (with evidence + test generation)
  - type: agent
    name: ui-login
    prompt: |
      1. Go to /login
      2. Enter test@example.com / password123
      3. Submit
      4. Verify redirect to /dashboard
    evidence:
      - screenshot: after-login
      - url: contains "/dashboard"
    generate_test: true

2. Implement

> Implement auth based on .claude/evals/auth.yaml

3. Verify

> /eval verify auth

Output:

🔍 Eval: auth
═══════════════════════════════════════

Deterministic:
  ✅ command: npm test (exit 0)
  ✅ file-contains: bcrypt in password.ts

Agent:
  ✅ ui-login: Dashboard redirect works
     📸 Evidence: 2 screenshots saved
     📄 Test: tests/generated/test_auth_ui_login.py

═══════════════════════════════════════
📊 Results: 3/3 passed

4. Run Generated Tests (Forever)

pytest tests/generated/

The agent converted its semantic verification into deterministic tests.

How It Works

Non-Deterministic → Deterministic

Agent checks are semantic: "verify login works." But we need proof.

Verifier runs the check (browser automation, API calls, file inspection)
Collects evidence (screenshots, responses, DOM snapshots)
Generates executable test (pytest/vitest)
Future runs use the test (no agent needed)

Agent Check (expensive)    →    Evidence (proof)    →    Test (cheap, repeatable)
     ↓                              ↓                          ↓
"Login works"              screenshot + url check      pytest + playwright

Evidence-Based Verification

The verifier can't just say "pass." It must provide evidence:

- type: agent
  name: login-flow
  prompt: "Verify login redirects to dashboard"
  evidence:
    - screenshot: login-page
    - screenshot: after-submit
    - url: contains "/dashboard"
    - element: '[data-testid="welcome"]'

Evidence is saved to .claude/evals/.evidence/<eval>/:

{
  "eval": "auth",
  "checks": [{
    "name": "login-flow",
    "pass": true,
    "evidence": [
      {"type": "screenshot", "path": "login-page.png"},
      {"type": "screenshot", "path": "after-submit.png"},
      {"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
      {"type": "element", "selector": "[data-testid=welcome]", "found": true}
    ]
  }]
}

Check Types

Deterministic (Fast, No Agent)

# Command + exit code
- type: command
  run: "pytest tests/"
  expect: exit_code 0

# Command + output
- type: command
  run: "curl localhost:3000/health"
  expect:
    contains: '"status":"ok"'

# File exists
- type: file-exists
  path: src/feature.ts

# File contains pattern
- type: file-contains
  path: src/auth.ts
  pattern: "bcrypt"

# File does NOT contain
- type: file-not-contains
  path: .env
  pattern: "sk-"

Agent (Semantic, Evidence-Based)

- type: agent
  name: descriptive-name
  prompt: |
    Step-by-step verification instructions
  evidence:
    - screenshot: step-name
    - url: contains "pattern"
    - element: "css-selector"
    - text: "expected text"
    - response: status 200
  generate_test: true  # Write executable test

Commands

Command	Description
`/eval list`	List all evals
`/eval show <name>`	Display eval spec
`/eval verify <name>`	Run verification
`/eval verify`	Run all evals
`/eval evidence <name>`	Show collected evidence
`/eval tests`	List generated tests
`/eval clean`	Remove evidence + generated tests

Directory Structure

.claude/
├── skills/eval/SKILL.md       # Eval generation skill
├── agents/eval-verifier.md    # Verification agent
├── commands/eval.md           # /eval command
└── evals/
    ├── auth.yaml              # Your eval specs
    ├── checkout.yaml
    └── .evidence/
        ├── auth/
        │   ├── evidence.json
        │   └── *.png
        └── checkout/
            └── ...

tests/
└── generated/                  # Tests written by verifier
    ├── test_auth_ui_login.py
    └── test_auth_api_login.py

Requirements

Claude Code with skills/agents/commands support
For UI testing: npm install -g @anthropic/agent-browser

Philosophy

TDD for Agents:

Traditional TDD	Agent TDD
Write tests	Write evals
Write code	Claude writes code
Tests pass	Claude verifies + generates tests

Why generate tests?

Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.

Mix deterministic and semantic:

Deterministic: "tests pass", "file exists", "command succeeds"
Semantic: "UI looks right", "error is helpful", "code is readable"

Use deterministic where possible, semantic where necessary.

License

MIT

8.1 KiB Raw Blame History