mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 08:03:44 +00:00
244 lines
5.5 KiB
Markdown
244 lines
5.5 KiB
Markdown
---
|
|
name: eval
|
|
description: Generate evaluation specs with building and verification criteria. Use when setting up features, defining acceptance criteria, or before implementing anything significant. Triggers on "create evals", "set up verification", "define acceptance criteria", or "build [feature]".
|
|
allowed-tools: Read, Grep, Glob, Write, Edit
|
|
---
|
|
|
|
# Eval Skill
|
|
|
|
Generate specs that define **what to build** and **how to verify it**.
|
|
|
|
## Output
|
|
|
|
I create `.claude/evals/<name>.yaml` with two sections:
|
|
|
|
1. **building_spec** — What the builder agent implements
|
|
2. **verification_spec** — What the verifier agent checks
|
|
|
|
## Format
|
|
|
|
```yaml
|
|
name: feature-name
|
|
description: One-line summary
|
|
|
|
building_spec:
|
|
description: What to build
|
|
requirements:
|
|
- Requirement 1
|
|
- Requirement 2
|
|
constraints:
|
|
- Constraint 1
|
|
files:
|
|
- suggested/file/paths.ts
|
|
|
|
test_output:
|
|
framework: pytest | vitest | jest
|
|
path: tests/generated/
|
|
|
|
verification_spec:
|
|
# Deterministic checks
|
|
- type: command
|
|
run: "npm test"
|
|
expect: exit_code 0
|
|
|
|
# Agent checks
|
|
- type: agent
|
|
name: check-name
|
|
prompt: |
|
|
What to verify
|
|
evidence:
|
|
- screenshot: name
|
|
- url: contains "pattern"
|
|
generate_test: true
|
|
```
|
|
|
|
## Workflow
|
|
|
|
### User Request
|
|
|
|
```
|
|
Create evals for user authentication
|
|
```
|
|
|
|
### My Questions
|
|
|
|
Before generating, I ask:
|
|
1. What auth method? (email/password, OAuth, magic link?)
|
|
2. UI, API, or both?
|
|
3. Specific security requirements?
|
|
|
|
### My Output
|
|
|
|
`.claude/evals/auth.yaml`:
|
|
|
|
```yaml
|
|
name: auth
|
|
description: Email/password authentication with UI and API
|
|
|
|
building_spec:
|
|
description: |
|
|
User authentication system with email/password.
|
|
Secure password storage, JWT tokens, login/signup flows.
|
|
requirements:
|
|
- Password hashing with bcrypt (cost factor 12+)
|
|
- JWT tokens with 24h expiry
|
|
- POST /api/auth/login endpoint
|
|
- POST /api/auth/signup endpoint
|
|
- Login page at /login
|
|
- Signup page at /signup
|
|
- Protected route middleware
|
|
constraints:
|
|
- No plaintext passwords anywhere
|
|
- Tokens must be httpOnly cookies or secure headers
|
|
files:
|
|
- src/auth/password.ts
|
|
- src/auth/jwt.ts
|
|
- src/auth/middleware.ts
|
|
- src/routes/auth.ts
|
|
- src/pages/login.tsx
|
|
- src/pages/signup.tsx
|
|
|
|
test_output:
|
|
framework: pytest
|
|
path: tests/generated/
|
|
|
|
verification_spec:
|
|
# --- Deterministic ---
|
|
- type: command
|
|
run: "npm test -- --grep auth"
|
|
expect: exit_code 0
|
|
|
|
- type: file-contains
|
|
path: src/auth/password.ts
|
|
pattern: "bcrypt"
|
|
|
|
- type: file-not-contains
|
|
path: src/
|
|
pattern: "password.*=.*plaintext"
|
|
|
|
# --- Agent: API ---
|
|
- type: agent
|
|
name: api-login
|
|
prompt: |
|
|
Test login API:
|
|
1. POST /api/auth/signup with new user
|
|
2. Verify 201 response
|
|
3. POST /api/auth/login with same creds
|
|
4. Verify 200 with JWT token
|
|
5. POST /api/auth/login with wrong password
|
|
6. Verify 401 with helpful message
|
|
evidence:
|
|
- response: status 201
|
|
- response: status 200
|
|
- response: has "token"
|
|
- response: status 401
|
|
generate_test: true
|
|
|
|
# --- Agent: UI ---
|
|
- type: agent
|
|
name: ui-login
|
|
prompt: |
|
|
Test login UI:
|
|
1. Go to /login
|
|
2. Verify form has email + password fields
|
|
3. Submit with valid credentials
|
|
4. Verify redirect to /dashboard
|
|
5. Verify welcome message visible
|
|
evidence:
|
|
- screenshot: login-page
|
|
- screenshot: after-login
|
|
- url: contains "/dashboard"
|
|
- element: '[data-testid="welcome"]'
|
|
generate_test: true
|
|
|
|
# --- Agent: Security ---
|
|
- type: agent
|
|
name: password-security
|
|
prompt: |
|
|
Verify password security:
|
|
1. Read src/auth/password.ts
|
|
2. Confirm bcrypt with cost >= 12
|
|
3. Confirm no password logging
|
|
4. Check signup doesn't echo password
|
|
evidence:
|
|
- text: "bcrypt"
|
|
- text: "cost" or "rounds"
|
|
generate_test: false # Code review, not repeatable test
|
|
```
|
|
|
|
## Check Types
|
|
|
|
### Deterministic
|
|
|
|
```yaml
|
|
- type: command
|
|
run: "shell command"
|
|
expect: exit_code 0
|
|
|
|
- type: command
|
|
run: "curl localhost:3000/health"
|
|
expect:
|
|
contains: '"ok"'
|
|
|
|
- type: file-exists
|
|
path: src/file.ts
|
|
|
|
- type: file-contains
|
|
path: src/file.ts
|
|
pattern: "regex pattern"
|
|
|
|
- type: file-not-contains
|
|
path: src/file.ts
|
|
pattern: "bad pattern"
|
|
```
|
|
|
|
### Agent
|
|
|
|
```yaml
|
|
- type: agent
|
|
name: descriptive-name # Used for evidence/test naming
|
|
prompt: |
|
|
Step-by-step verification
|
|
evidence:
|
|
- screenshot: step-name
|
|
- url: contains "pattern"
|
|
- element: "css-selector"
|
|
- text: "expected text"
|
|
- response: status 200
|
|
- response: has "field"
|
|
generate_test: true | false
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Building Spec
|
|
|
|
- **Be specific** — "bcrypt with cost 12" not "secure passwords"
|
|
- **List files** — helps builder know where to put code
|
|
- **State constraints** — what NOT to do matters
|
|
|
|
### Verification Spec
|
|
|
|
- **Deterministic first** — fast, reliable checks
|
|
- **Agent for semantics** — UI flows, code quality, error messages
|
|
- **Evidence always** — no claim without proof
|
|
- **generate_test for repeatables** — UI flows yes, code review no
|
|
|
|
### Naming
|
|
|
|
- `name: feature-name` — lowercase, hyphens
|
|
- `name: api-login` — for agent checks, descriptive
|
|
|
|
## What Happens Next
|
|
|
|
After I create the spec:
|
|
|
|
```
|
|
/eval build auth
|
|
```
|
|
|
|
1. Builder agent reads `building_spec`, implements
|
|
2. Verifier agent reads `verification_spec`, checks
|
|
3. If fail → builder gets feedback → fixes → verifier re-checks
|
|
4. Loop until pass
|
|
5. Agent checks become tests in `tests/generated/`
|