mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-18 02:03:06 +00:00
iterate
This commit is contained in:
parent
aca2126c88
commit
7c63331389
5 changed files with 520 additions and 664 deletions
|
|
@ -1,280 +1,244 @@
|
|||
---
|
||||
name: eval
|
||||
description: Generate evaluation specs for code verification. Use when setting up tests, defining acceptance criteria, or creating verification checkpoints before implementing features. Triggers on "create evals", "define acceptance criteria", "set up verification", or "how will we know this works".
|
||||
description: Generate evaluation specs with building and verification criteria. Use when setting up features, defining acceptance criteria, or before implementing anything significant. Triggers on "create evals", "set up verification", "define acceptance criteria", or "build [feature]".
|
||||
allowed-tools: Read, Grep, Glob, Write, Edit
|
||||
---
|
||||
|
||||
# Eval Skill
|
||||
|
||||
Generate evaluation specs (YAML) that define what to verify. I do NOT run verification — that's the verifier agent's job.
|
||||
Generate specs that define **what to build** and **how to verify it**.
|
||||
|
||||
## My Responsibilities
|
||||
## Output
|
||||
|
||||
1. Understand what needs verification
|
||||
2. Ask clarifying questions
|
||||
3. Generate `.claude/evals/<name>.yaml` specs
|
||||
4. Define checks with clear success criteria
|
||||
I create `.claude/evals/<name>.yaml` with two sections:
|
||||
|
||||
## What I Do NOT Do
|
||||
1. **building_spec** — What the builder agent implements
|
||||
2. **verification_spec** — What the verifier agent checks
|
||||
|
||||
- Run tests or commands
|
||||
- Collect evidence
|
||||
- Generate test code
|
||||
- Make pass/fail judgments
|
||||
|
||||
## Eval Spec Format
|
||||
## Format
|
||||
|
||||
```yaml
|
||||
name: feature-name
|
||||
description: What this eval verifies
|
||||
description: One-line summary
|
||||
|
||||
building_spec:
|
||||
description: What to build
|
||||
requirements:
|
||||
- Requirement 1
|
||||
- Requirement 2
|
||||
constraints:
|
||||
- Constraint 1
|
||||
files:
|
||||
- suggested/file/paths.ts
|
||||
|
||||
# Where generated tests should go
|
||||
test_output:
|
||||
framework: pytest # or vitest, jest
|
||||
framework: pytest | vitest | jest
|
||||
path: tests/generated/
|
||||
|
||||
verify:
|
||||
# === DETERMINISTIC CHECKS ===
|
||||
# These run as-is, fast and reliable
|
||||
|
||||
verification_spec:
|
||||
# Deterministic checks
|
||||
- type: command
|
||||
run: "npm test -- --grep 'auth'"
|
||||
run: "npm test"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: file-exists
|
||||
path: src/auth/login.ts
|
||||
|
||||
- type: file-contains
|
||||
path: src/auth/login.ts
|
||||
pattern: "export function login"
|
||||
|
||||
- type: file-not-contains
|
||||
path: src/config.ts
|
||||
pattern: "API_KEY=sk-"
|
||||
|
||||
# === AGENT CHECKS ===
|
||||
# Verifier agent runs these, collects evidence, generates tests
|
||||
|
||||
# Agent checks
|
||||
- type: agent
|
||||
name: login-flow # Used for evidence/test naming
|
||||
name: check-name
|
||||
prompt: |
|
||||
Verify login with valid credentials:
|
||||
1. Navigate to /login
|
||||
2. Enter test@example.com / password123
|
||||
3. Submit form
|
||||
4. Verify redirect to /dashboard
|
||||
5. Verify welcome message visible
|
||||
What to verify
|
||||
evidence:
|
||||
- screenshot: after-login
|
||||
- url: contains "/dashboard"
|
||||
- element: '[data-testid="welcome"]'
|
||||
generate_test: true # Verifier should write a test for this
|
||||
```
|
||||
|
||||
## Check Types
|
||||
|
||||
### Deterministic (Verifier runs directly)
|
||||
|
||||
```yaml
|
||||
# Command with exit code
|
||||
- type: command
|
||||
run: "pytest tests/auth/"
|
||||
expect: exit_code 0
|
||||
|
||||
# Command with output check
|
||||
- type: command
|
||||
run: "curl -s localhost:3000/health"
|
||||
expect:
|
||||
contains: '"status":"ok"'
|
||||
|
||||
# File existence
|
||||
- type: file-exists
|
||||
path: src/feature.ts
|
||||
|
||||
# File content
|
||||
- type: file-contains
|
||||
path: src/feature.ts
|
||||
pattern: "export function"
|
||||
|
||||
# Negative file content
|
||||
- type: file-not-contains
|
||||
path: .env.example
|
||||
pattern: "real-api-key"
|
||||
```
|
||||
|
||||
### Agent (Verifier interprets, collects evidence, may generate test)
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: descriptive-name
|
||||
prompt: |
|
||||
Clear instructions for what to verify.
|
||||
Be specific about:
|
||||
- What to do
|
||||
- What to check
|
||||
- What success looks like
|
||||
evidence:
|
||||
- screenshot: step-name # Capture screenshot
|
||||
- url: contains "pattern" # Check URL
|
||||
- element: "selector" # Check element exists
|
||||
- text: "expected text" # Check text visible
|
||||
- response: status 200 # Check HTTP response
|
||||
generate_test: true|false # Should verifier write a test?
|
||||
- screenshot: name
|
||||
- url: contains "pattern"
|
||||
generate_test: true
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### User asks to create evals
|
||||
### User Request
|
||||
|
||||
**User**: Create evals for user authentication
|
||||
```
|
||||
Create evals for user authentication
|
||||
```
|
||||
|
||||
**Me**: I'll define verification criteria. Questions:
|
||||
1. Auth method? (email/password, OAuth, magic link?)
|
||||
2. Should I verify UI, API, or both?
|
||||
3. Security requirements? (password hashing, rate limiting?)
|
||||
### My Questions
|
||||
|
||||
**User**: Email/password, both UI and API, must hash passwords
|
||||
Before generating, I ask:
|
||||
1. What auth method? (email/password, OAuth, magic link?)
|
||||
2. UI, API, or both?
|
||||
3. Specific security requirements?
|
||||
|
||||
**Me**: Creating `.claude/evals/auth.yaml`:
|
||||
### My Output
|
||||
|
||||
`.claude/evals/auth.yaml`:
|
||||
|
||||
```yaml
|
||||
name: auth
|
||||
description: Email/password authentication with UI and API
|
||||
|
||||
building_spec:
|
||||
description: |
|
||||
User authentication system with email/password.
|
||||
Secure password storage, JWT tokens, login/signup flows.
|
||||
requirements:
|
||||
- Password hashing with bcrypt (cost factor 12+)
|
||||
- JWT tokens with 24h expiry
|
||||
- POST /api/auth/login endpoint
|
||||
- POST /api/auth/signup endpoint
|
||||
- Login page at /login
|
||||
- Signup page at /signup
|
||||
- Protected route middleware
|
||||
constraints:
|
||||
- No plaintext passwords anywhere
|
||||
- Tokens must be httpOnly cookies or secure headers
|
||||
files:
|
||||
- src/auth/password.ts
|
||||
- src/auth/jwt.ts
|
||||
- src/auth/middleware.ts
|
||||
- src/routes/auth.ts
|
||||
- src/pages/login.tsx
|
||||
- src/pages/signup.tsx
|
||||
|
||||
test_output:
|
||||
framework: pytest
|
||||
path: tests/generated/
|
||||
|
||||
verify:
|
||||
verification_spec:
|
||||
# --- Deterministic ---
|
||||
- type: command
|
||||
run: "npm test -- --grep 'auth'"
|
||||
run: "npm test -- --grep auth"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: file-contains
|
||||
path: src/auth/password.ts
|
||||
pattern: "bcrypt|argon2|scrypt"
|
||||
pattern: "bcrypt"
|
||||
|
||||
- type: file-not-contains
|
||||
path: src/auth/
|
||||
path: src/
|
||||
pattern: "password.*=.*plaintext"
|
||||
|
||||
# --- Agent: API Login ---
|
||||
# --- Agent: API ---
|
||||
- type: agent
|
||||
name: api-login
|
||||
prompt: |
|
||||
Test login API:
|
||||
1. POST /api/auth/login with {"email": "test@example.com", "password": "password123"}
|
||||
2. Verify 200 response with JWT token
|
||||
3. POST with wrong password
|
||||
4. Verify 401 with helpful error message
|
||||
1. POST /api/auth/signup with new user
|
||||
2. Verify 201 response
|
||||
3. POST /api/auth/login with same creds
|
||||
4. Verify 200 with JWT token
|
||||
5. POST /api/auth/login with wrong password
|
||||
6. Verify 401 with helpful message
|
||||
evidence:
|
||||
- response: status 201
|
||||
- response: status 200
|
||||
- response: has "token"
|
||||
- response: status 401
|
||||
generate_test: true
|
||||
|
||||
# --- Agent: UI Login ---
|
||||
# --- Agent: UI ---
|
||||
- type: agent
|
||||
name: ui-login
|
||||
prompt: |
|
||||
Test login UI:
|
||||
1. Go to /login
|
||||
2. Enter valid credentials
|
||||
3. Submit
|
||||
2. Verify form has email + password fields
|
||||
3. Submit with valid credentials
|
||||
4. Verify redirect to /dashboard
|
||||
5. Verify welcome message
|
||||
5. Verify welcome message visible
|
||||
evidence:
|
||||
- screenshot: login-page
|
||||
- screenshot: after-submit
|
||||
- screenshot: after-login
|
||||
- url: contains "/dashboard"
|
||||
- element: '[data-testid="welcome"]'
|
||||
generate_test: true
|
||||
|
||||
# --- Agent: Error Handling ---
|
||||
# --- Agent: Security ---
|
||||
- type: agent
|
||||
name: login-errors
|
||||
name: password-security
|
||||
prompt: |
|
||||
Test error handling:
|
||||
1. Submit empty form - verify field validation
|
||||
2. Submit wrong password - verify error message is helpful
|
||||
3. Verify password field clears but email persists
|
||||
Verify password security:
|
||||
1. Read src/auth/password.ts
|
||||
2. Confirm bcrypt with cost >= 12
|
||||
3. Confirm no password logging
|
||||
4. Check signup doesn't echo password
|
||||
evidence:
|
||||
- screenshot: validation-errors
|
||||
- screenshot: wrong-password
|
||||
- text: "Invalid email or password"
|
||||
generate_test: true
|
||||
- text: "bcrypt"
|
||||
- text: "cost" or "rounds"
|
||||
generate_test: false # Code review, not repeatable test
|
||||
```
|
||||
|
||||
Now run `/eval verify auth` to have the verifier agent:
|
||||
1. Run deterministic checks
|
||||
2. Execute agent checks with evidence collection
|
||||
3. Generate tests in `tests/generated/`
|
||||
4. Report results
|
||||
## Check Types
|
||||
|
||||
### Deterministic
|
||||
|
||||
```yaml
|
||||
- type: command
|
||||
run: "shell command"
|
||||
expect: exit_code 0
|
||||
|
||||
- type: command
|
||||
run: "curl localhost:3000/health"
|
||||
expect:
|
||||
contains: '"ok"'
|
||||
|
||||
- type: file-exists
|
||||
path: src/file.ts
|
||||
|
||||
- type: file-contains
|
||||
path: src/file.ts
|
||||
pattern: "regex pattern"
|
||||
|
||||
- type: file-not-contains
|
||||
path: src/file.ts
|
||||
pattern: "bad pattern"
|
||||
```
|
||||
|
||||
### Agent
|
||||
|
||||
```yaml
|
||||
- type: agent
|
||||
name: descriptive-name # Used for evidence/test naming
|
||||
prompt: |
|
||||
Step-by-step verification
|
||||
evidence:
|
||||
- screenshot: step-name
|
||||
- url: contains "pattern"
|
||||
- element: "css-selector"
|
||||
- text: "expected text"
|
||||
- response: status 200
|
||||
- response: has "field"
|
||||
generate_test: true | false
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Be Specific in Prompts
|
||||
```yaml
|
||||
# ❌ Vague
|
||||
prompt: "Make sure login works"
|
||||
### Building Spec
|
||||
|
||||
# ✅ Specific
|
||||
prompt: |
|
||||
1. Navigate to /login
|
||||
2. Enter test@example.com in email field
|
||||
3. Enter password123 in password field
|
||||
4. Click submit button
|
||||
5. Verify URL is /dashboard
|
||||
6. Verify text "Welcome" is visible
|
||||
```
|
||||
- **Be specific** — "bcrypt with cost 12" not "secure passwords"
|
||||
- **List files** — helps builder know where to put code
|
||||
- **State constraints** — what NOT to do matters
|
||||
|
||||
### Specify Evidence
|
||||
```yaml
|
||||
# ❌ No evidence
|
||||
- type: agent
|
||||
prompt: "Check the UI looks right"
|
||||
### Verification Spec
|
||||
|
||||
# ✅ Evidence defined
|
||||
- type: agent
|
||||
prompt: "Check login form has email and password fields"
|
||||
evidence:
|
||||
- screenshot: login-form
|
||||
- element: 'input[type="email"]'
|
||||
- element: 'input[type="password"]'
|
||||
```
|
||||
- **Deterministic first** — fast, reliable checks
|
||||
- **Agent for semantics** — UI flows, code quality, error messages
|
||||
- **Evidence always** — no claim without proof
|
||||
- **generate_test for repeatables** — UI flows yes, code review no
|
||||
|
||||
### Enable Test Generation for Repeatables
|
||||
```yaml
|
||||
# UI flows → generate tests (repeatable)
|
||||
- type: agent
|
||||
name: checkout-flow
|
||||
generate_test: true
|
||||
### Naming
|
||||
|
||||
# Subjective review → no test (human judgment)
|
||||
- type: agent
|
||||
name: code-quality
|
||||
generate_test: false
|
||||
prompt: "Review error messages for helpfulness"
|
||||
```
|
||||
- `name: feature-name` — lowercase, hyphens
|
||||
- `name: api-login` — for agent checks, descriptive
|
||||
|
||||
## Directory Structure
|
||||
## What Happens Next
|
||||
|
||||
After running evals:
|
||||
After I create the spec:
|
||||
|
||||
```
|
||||
.claude/
|
||||
├── evals/
|
||||
│ ├── auth.yaml # Eval spec (I create this)
|
||||
│ └── .evidence/
|
||||
│ ├── auth/
|
||||
│ │ ├── ui-login-001.png
|
||||
│ │ ├── ui-login-002.png
|
||||
│ │ └── evidence.json # Structured evidence
|
||||
│ └── ...
|
||||
tests/
|
||||
└── generated/
|
||||
├── test_auth_api_login.py # Verifier generates
|
||||
├── test_auth_ui_login.py # Verifier generates
|
||||
└── ...
|
||||
/eval build auth
|
||||
```
|
||||
|
||||
1. Builder agent reads `building_spec`, implements
|
||||
2. Verifier agent reads `verification_spec`, checks
|
||||
3. If fail → builder gets feedback → fixes → verifier re-checks
|
||||
4. Loop until pass
|
||||
5. Agent checks become tests in `tests/generated/`
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue