mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 06:04:42 +00:00
5.5 KiB
5.5 KiB
| name | description | allowed-tools |
|---|---|---|
| eval | Generate evaluation specs with building and verification criteria. Use when setting up features, defining acceptance criteria, or before implementing anything significant. Triggers on "create evals", "set up verification", "define acceptance criteria", or "build [feature]". | Read, Grep, Glob, Write, Edit |
Eval Skill
Generate specs that define what to build and how to verify it.
Output
I create .claude/evals/<name>.yaml with two sections:
- building_spec — What the builder agent implements
- verification_spec — What the verifier agent checks
Format
name: feature-name
description: One-line summary
building_spec:
description: What to build
requirements:
- Requirement 1
- Requirement 2
constraints:
- Constraint 1
files:
- suggested/file/paths.ts
test_output:
framework: pytest | vitest | jest
path: tests/generated/
verification_spec:
# Deterministic checks
- type: command
run: "npm test"
expect: exit_code 0
# Agent checks
- type: agent
name: check-name
prompt: |
What to verify
evidence:
- screenshot: name
- url: contains "pattern"
generate_test: true
Workflow
User Request
Create evals for user authentication
My Questions
Before generating, I ask:
- What auth method? (email/password, OAuth, magic link?)
- UI, API, or both?
- Specific security requirements?
My Output
.claude/evals/auth.yaml:
name: auth
description: Email/password authentication with UI and API
building_spec:
description: |
User authentication system with email/password.
Secure password storage, JWT tokens, login/signup flows.
requirements:
- Password hashing with bcrypt (cost factor 12+)
- JWT tokens with 24h expiry
- POST /api/auth/login endpoint
- POST /api/auth/signup endpoint
- Login page at /login
- Signup page at /signup
- Protected route middleware
constraints:
- No plaintext passwords anywhere
- Tokens must be httpOnly cookies or secure headers
files:
- src/auth/password.ts
- src/auth/jwt.ts
- src/auth/middleware.ts
- src/routes/auth.ts
- src/pages/login.tsx
- src/pages/signup.tsx
test_output:
framework: pytest
path: tests/generated/
verification_spec:
# --- Deterministic ---
- type: command
run: "npm test -- --grep auth"
expect: exit_code 0
- type: file-contains
path: src/auth/password.ts
pattern: "bcrypt"
- type: file-not-contains
path: src/
pattern: "password.*=.*plaintext"
# --- Agent: API ---
- type: agent
name: api-login
prompt: |
Test login API:
1. POST /api/auth/signup with new user
2. Verify 201 response
3. POST /api/auth/login with same creds
4. Verify 200 with JWT token
5. POST /api/auth/login with wrong password
6. Verify 401 with helpful message
evidence:
- response: status 201
- response: status 200
- response: has "token"
- response: status 401
generate_test: true
# --- Agent: UI ---
- type: agent
name: ui-login
prompt: |
Test login UI:
1. Go to /login
2. Verify form has email + password fields
3. Submit with valid credentials
4. Verify redirect to /dashboard
5. Verify welcome message visible
evidence:
- screenshot: login-page
- screenshot: after-login
- url: contains "/dashboard"
- element: '[data-testid="welcome"]'
generate_test: true
# --- Agent: Security ---
- type: agent
name: password-security
prompt: |
Verify password security:
1. Read src/auth/password.ts
2. Confirm bcrypt with cost >= 12
3. Confirm no password logging
4. Check signup doesn't echo password
evidence:
- text: "bcrypt"
- text: "cost" or "rounds"
generate_test: false # Code review, not repeatable test
Check Types
Deterministic
- type: command
run: "shell command"
expect: exit_code 0
- type: command
run: "curl localhost:3000/health"
expect:
contains: '"ok"'
- type: file-exists
path: src/file.ts
- type: file-contains
path: src/file.ts
pattern: "regex pattern"
- type: file-not-contains
path: src/file.ts
pattern: "bad pattern"
Agent
- type: agent
name: descriptive-name # Used for evidence/test naming
prompt: |
Step-by-step verification
evidence:
- screenshot: step-name
- url: contains "pattern"
- element: "css-selector"
- text: "expected text"
- response: status 200
- response: has "field"
generate_test: true | false
Best Practices
Building Spec
- Be specific — "bcrypt with cost 12" not "secure passwords"
- List files — helps builder know where to put code
- State constraints — what NOT to do matters
Verification Spec
- Deterministic first — fast, reliable checks
- Agent for semantics — UI flows, code quality, error messages
- Evidence always — no claim without proof
- generate_test for repeatables — UI flows yes, code review no
Naming
name: feature-name— lowercase, hyphensname: api-login— for agent checks, descriptive
What Happens Next
After I create the spec:
/eval build auth
- Builder agent reads
building_spec, implements - Verifier agent reads
verification_spec, checks - If fail → builder gets feedback → fixes → verifier re-checks
- Loop until pass
- Agent checks become tests in
tests/generated/