mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 08:03:44 +00:00
187 lines
4.6 KiB
Markdown
187 lines
4.6 KiB
Markdown
# eval-skill
|
|
|
|
Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.
|
|
|
|
## Why
|
|
|
|
> *"How will the agent know it did the right thing?"*
|
|
|
|
Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.
|
|
|
|
## How It Works
|
|
|
|
```
|
|
You: "Build auth with email/password"
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Skill: eval │
|
|
│ Generates: │
|
|
│ • verification spec (tests) │
|
|
│ • building spec (what to build) │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Agent: builder │
|
|
│ Implements from building spec │
|
|
│ Clean context, focused on code │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Agent: verifier │
|
|
│ Runs checks, collects evidence │
|
|
│ Returns pass/fail │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
Pass? Done.
|
|
Fail? → Builder fixes → Verifier checks → Loop
|
|
```
|
|
|
|
Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
git clone https://github.com/yourusername/eval-skill.git
|
|
cd eval-skill
|
|
./install.sh # Current project
|
|
./install.sh --global # All projects
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Step 1: Create Specs
|
|
|
|
```
|
|
Create evals for user authentication with email/password
|
|
```
|
|
|
|
Creates `.claude/evals/auth.yaml`:
|
|
|
|
```yaml
|
|
name: auth
|
|
|
|
building_spec:
|
|
description: Email/password auth with login/signup
|
|
requirements:
|
|
- Password hashing with bcrypt
|
|
- JWT tokens on login
|
|
- /login and /signup endpoints
|
|
|
|
verification_spec:
|
|
- type: command
|
|
run: "npm test -- --grep auth"
|
|
expect: exit_code 0
|
|
|
|
- type: file-contains
|
|
path: src/auth/password.ts
|
|
pattern: "bcrypt"
|
|
|
|
- type: agent
|
|
name: login-flow
|
|
prompt: |
|
|
1. POST /api/login with valid creds
|
|
2. Verify JWT in response
|
|
3. POST with wrong password
|
|
4. Verify 401 + helpful error
|
|
generate_test: true
|
|
```
|
|
|
|
### Step 2: Build
|
|
|
|
```
|
|
/eval build auth
|
|
```
|
|
|
|
Spawns builder agent → implements → spawns verifier → checks → iterates until pass.
|
|
|
|
### Step 3: Run Generated Tests (Forever)
|
|
|
|
```bash
|
|
pytest tests/generated/
|
|
```
|
|
|
|
Agent checks become deterministic tests. First run costs tokens. Future runs are free.
|
|
|
|
## Commands
|
|
|
|
| Command | What it does |
|
|
|---------|--------------|
|
|
| `/eval list` | List all evals |
|
|
| `/eval show <name>` | Display spec |
|
|
| `/eval build <name>` | Build + verify loop |
|
|
| `/eval verify <name>` | Just verify, no build |
|
|
|
|
## Why Context Isolation Matters
|
|
|
|
**Without isolation:**
|
|
```
|
|
Main Claude context:
|
|
- All verification logic
|
|
- All implementation code
|
|
- All error history
|
|
- Context bloat → degraded performance
|
|
```
|
|
|
|
**With isolation:**
|
|
```
|
|
Builder context: building spec + current failure only
|
|
Verifier context: verification spec + current code only
|
|
Main Claude: just orchestration
|
|
```
|
|
|
|
Each agent gets exactly what it needs. Nothing more.
|
|
|
|
## Check Types
|
|
|
|
**Deterministic** (fast, no agent):
|
|
```yaml
|
|
- type: command
|
|
run: "npm test"
|
|
expect: exit_code 0
|
|
|
|
- type: file-contains
|
|
path: src/auth.ts
|
|
pattern: "bcrypt"
|
|
```
|
|
|
|
**Agent** (semantic, generates tests):
|
|
```yaml
|
|
- type: agent
|
|
name: ui-login
|
|
prompt: "Navigate to /login, submit form, verify redirect"
|
|
evidence:
|
|
- screenshot: after-login
|
|
- url: contains "/dashboard"
|
|
generate_test: true
|
|
```
|
|
|
|
Agent checks produce evidence (screenshots, responses) and become executable tests.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
.claude/
|
|
├── skills/eval/ # Generates specs
|
|
├── agents/
|
|
│ ├── eval-builder.md
|
|
│ └── eval-verifier.md
|
|
├── commands/eval.md
|
|
└── evals/
|
|
├── auth.yaml
|
|
└── .evidence/ # Screenshots, logs
|
|
|
|
tests/generated/ # Tests from agent checks
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- Claude Code
|
|
- For UI testing: `npm install -g @anthropic/agent-browser`
|
|
|
|
## License
|
|
|
|
MIT
|