mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 09:01:15 +00:00
init
This commit is contained in:
commit
aca2126c88
6 changed files with 1233 additions and 0 deletions
293
README.md
Normal file
293
README.md
Normal file
|
|
@ -0,0 +1,293 @@
|
|||
# eval-skill
|
||||
|
||||
Give Claude a verification loop. Define acceptance criteria before implementation, let Claude check its own work.
|
||||
|
||||
## The Problem
|
||||
|
||||
> *"How will the agent know it did the right thing?"*
|
||||
> — [Thorsten Ball](https://x.com/thorstenball)
|
||||
|
||||
Without verification, Claude implements and hopes. With verification, Claude implements and **knows**.
|
||||
|
||||
## The Solution
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 1. SKILL: eval │
|
||||
│ "Create evals for auth" │
|
||||
│ → Generates .claude/evals/auth.yaml │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 2. AGENT: eval-verifier │
|
||||
│ "/eval verify auth" │
|
||||
│ → Runs checks │
|
||||
│ → Collects evidence (screenshots, outputs) │
|
||||
│ → Generates executable tests │
|
||||
│ → Reports pass/fail │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 3. OUTPUT │
|
||||
│ .claude/evals/.evidence/auth/ ← Screenshots, logs │
|
||||
│ tests/generated/test_auth.py ← Executable tests │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
git clone https://github.com/yourusername/eval-skill.git
|
||||
cd eval-skill
|
||||
|
||||
# Install to current project
|
||||
./install.sh
|
||||
|
||||
# Or install globally (all projects)
|
||||
./install.sh --global
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Create Evals (Before Implementation)
|
||||
|
||||
```
|
||||
> Create evals for user authentication
|
||||
```
|
||||
|
||||
Claude generates `.claude/evals/auth.yaml`:
|
||||
|
||||
```yaml
|
||||
name: auth
|
||||
description: Email/password authentication
|
||||
|
||||
test_output:
|
||||
framework: pytest
|
||||
path: tests/generated/
|
||||
|
||||
verify:
|
||||
# Deterministic
|
||||
- type: command
|
||||
run: "npm test -- --grep 'auth'"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: file-contains
|
||||
path: src/auth/password.ts
|
||||
pattern: "bcrypt|argon2"
|
||||
|
||||
# Agent-based (with evidence + test generation)
|
||||
- type: agent
|
||||
name: ui-login
|
||||
prompt: |
|
||||
1. Go to /login
|
||||
2. Enter test@example.com / password123
|
||||
3. Submit
|
||||
4. Verify redirect to /dashboard
|
||||
evidence:
|
||||
- screenshot: after-login
|
||||
- url: contains "/dashboard"
|
||||
generate_test: true
|
||||
```
|
||||
|
||||
### 2. Implement
|
||||
|
||||
```
|
||||
> Implement auth based on .claude/evals/auth.yaml
|
||||
```
|
||||
|
||||
### 3. Verify
|
||||
|
||||
```
|
||||
> /eval verify auth
|
||||
```
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
🔍 Eval: auth
|
||||
═══════════════════════════════════════
|
||||
|
||||
Deterministic:
|
||||
✅ command: npm test (exit 0)
|
||||
✅ file-contains: bcrypt in password.ts
|
||||
|
||||
Agent:
|
||||
✅ ui-login: Dashboard redirect works
|
||||
📸 Evidence: 2 screenshots saved
|
||||
📄 Test: tests/generated/test_auth_ui_login.py
|
||||
|
||||
═══════════════════════════════════════
|
||||
📊 Results: 3/3 passed
|
||||
```
|
||||
|
||||
### 4. Run Generated Tests (Forever)
|
||||
|
||||
```bash
|
||||
pytest tests/generated/
|
||||
```
|
||||
|
||||
The agent converted its semantic verification into deterministic tests.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Non-Deterministic → Deterministic
|
||||
|
||||
Agent checks are semantic: "verify login works." But we need proof.
|
||||
|
||||
1. **Verifier runs the check** (browser automation, API calls, file inspection)
|
||||
2. **Collects evidence** (screenshots, responses, DOM snapshots)
|
||||
3. **Generates executable test** (pytest/vitest)
|
||||
4. **Future runs use the test** (no agent needed)
|
||||
|
||||
```
|
||||
Agent Check (expensive) → Evidence (proof) → Test (cheap, repeatable)
|
||||
↓ ↓ ↓
|
||||
"Login works" screenshot + url check pytest + playwright
|
||||
```
|
||||
|
||||
### Evidence-Based Verification
|
||||
|
||||
The verifier can't just say "pass." It must provide evidence:
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: login-flow
|
||||
prompt: "Verify login redirects to dashboard"
|
||||
evidence:
|
||||
- screenshot: login-page
|
||||
- screenshot: after-submit
|
||||
- url: contains "/dashboard"
|
||||
- element: '[data-testid="welcome"]'
|
||||
```
|
||||
|
||||
Evidence is saved to `.claude/evals/.evidence/<eval>/`:
|
||||
|
||||
```json
|
||||
{
|
||||
"eval": "auth",
|
||||
"checks": [{
|
||||
"name": "login-flow",
|
||||
"pass": true,
|
||||
"evidence": [
|
||||
{"type": "screenshot", "path": "login-page.png"},
|
||||
{"type": "screenshot", "path": "after-submit.png"},
|
||||
{"type": "url", "expected": "contains /dashboard", "actual": "http://localhost:3000/dashboard"},
|
||||
{"type": "element", "selector": "[data-testid=welcome]", "found": true}
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
## Check Types
|
||||
|
||||
### Deterministic (Fast, No Agent)
|
||||
|
||||
```yaml
|
||||
# Command + exit code
|
||||
- type: command
|
||||
run: "pytest tests/"
|
||||
expect: exit_code 0
|
||||
|
||||
# Command + output
|
||||
- type: command
|
||||
run: "curl localhost:3000/health"
|
||||
expect:
|
||||
contains: '"status":"ok"'
|
||||
|
||||
# File exists
|
||||
- type: file-exists
|
||||
path: src/feature.ts
|
||||
|
||||
# File contains pattern
|
||||
- type: file-contains
|
||||
path: src/auth.ts
|
||||
pattern: "bcrypt"
|
||||
|
||||
# File does NOT contain
|
||||
- type: file-not-contains
|
||||
path: .env
|
||||
pattern: "sk-"
|
||||
```
|
||||
|
||||
### Agent (Semantic, Evidence-Based)
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: descriptive-name
|
||||
prompt: |
|
||||
Step-by-step verification instructions
|
||||
evidence:
|
||||
- screenshot: step-name
|
||||
- url: contains "pattern"
|
||||
- element: "css-selector"
|
||||
- text: "expected text"
|
||||
- response: status 200
|
||||
generate_test: true # Write executable test
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `/eval list` | List all evals |
|
||||
| `/eval show <name>` | Display eval spec |
|
||||
| `/eval verify <name>` | Run verification |
|
||||
| `/eval verify` | Run all evals |
|
||||
| `/eval evidence <name>` | Show collected evidence |
|
||||
| `/eval tests` | List generated tests |
|
||||
| `/eval clean` | Remove evidence + generated tests |
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
.claude/
|
||||
├── skills/eval/SKILL.md # Eval generation skill
|
||||
├── agents/eval-verifier.md # Verification agent
|
||||
├── commands/eval.md # /eval command
|
||||
└── evals/
|
||||
├── auth.yaml # Your eval specs
|
||||
├── checkout.yaml
|
||||
└── .evidence/
|
||||
├── auth/
|
||||
│ ├── evidence.json
|
||||
│ └── *.png
|
||||
└── checkout/
|
||||
└── ...
|
||||
|
||||
tests/
|
||||
└── generated/ # Tests written by verifier
|
||||
├── test_auth_ui_login.py
|
||||
└── test_auth_api_login.py
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Claude Code with skills/agents/commands support
|
||||
- For UI testing: `npm install -g @anthropic/agent-browser`
|
||||
|
||||
## Philosophy
|
||||
|
||||
**TDD for Agents:**
|
||||
|
||||
| Traditional TDD | Agent TDD |
|
||||
|----------------|-----------|
|
||||
| Write tests | Write evals |
|
||||
| Write code | Claude writes code |
|
||||
| Tests pass | Claude verifies + generates tests |
|
||||
|
||||
**Why generate tests?**
|
||||
|
||||
Agent verification is expensive (tokens, time). But once verified, we encode that verification as a test. Future runs use the test — no agent needed.
|
||||
|
||||
**Mix deterministic and semantic:**
|
||||
|
||||
- Deterministic: "tests pass", "file exists", "command succeeds"
|
||||
- Semantic: "UI looks right", "error is helpful", "code is readable"
|
||||
|
||||
Use deterministic where possible, semantic where necessary.
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
Loading…
Add table
Add a link
Reference in a new issue