This commit is contained in:
Harivansh Rathi 2026-01-14 00:11:59 -08:00
parent aca2126c88
commit 7c63331389
5 changed files with 520 additions and 664 deletions

View file

@ -1,169 +1,162 @@
---
description: Run eval commands - list, show, or verify evals
argument-hint: list | show <name> | verify [name]
description: Eval commands - list, show, build, verify
argument-hint: list | show <name> | build <name> | verify <name>
allowed-tools: Read, Bash, Task
---
# /eval Command
Interface for the eval system. I dispatch to the right action.
## Commands
### /eval list
List all eval specs:
```bash
echo "Available evals:"
echo ""
for f in .claude/evals/*.yaml 2>/dev/null; do
if [ -f "$f" ]; then
name=$(basename "$f" .yaml)
desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
printf " %-20s %s\n" "$name" "$desc"
fi
done
List all evals:
```
If no evals exist:
```
No evals found in .claude/evals/
Create evals by asking: "Create evals for [feature]"
Available evals:
auth Email/password authentication
checkout E-commerce checkout flow
```
### /eval show <name>
Display an eval spec:
Display the full eval spec.
```bash
cat ".claude/evals/$1.yaml"
### /eval build <name>
**The main command.** Orchestrates build → verify → fix loop.
```
/eval build auth
```
### /eval verify [name]
Flow:
1. Spawn **eval-builder** with building_spec
2. Builder implements, returns
3. Spawn **eval-verifier** with verification_spec
4. Verifier checks, returns pass/fail
5. If fail → spawn builder with failure context → goto 3
6. If pass → done
Run verification. This spawns the `eval-verifier` subagent.
**With name specified** (`/eval verify auth`):
Delegate to eval-verifier agent:
Output:
```
Run the eval-verifier agent to verify .claude/evals/auth.yaml
🔨 Building: auth
═══════════════════════════════════════
The agent should:
1. Read the eval spec
2. Run all checks in the verify list
3. Collect evidence for agent checks
4. Generate tests where generate_test: true
5. Report results with evidence
[Builder] Implementing...
+ src/auth/password.ts
+ src/auth/jwt.ts
+ src/routes/auth.ts
[Verifier] Checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
❌ api-login: Wrong status code
Expected: 401 on bad password
Actual: 500
[Builder] Fixing api-login...
~ src/routes/auth.ts
[Verifier] Re-checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
✅ api-login: Correct responses
📄 Test: tests/generated/test_auth_api_login.py
═══════════════════════════════════════
📊 Build complete: 3/3 checks passed
Iterations: 2
Tests generated: 1
```
**Without name** (`/eval verify`):
### /eval verify <name>
Just verify, don't build. For checking existing code.
```
/eval verify auth
```
Spawns verifier only. Reports pass/fail with evidence.
### /eval verify
Run all evals:
```
Run the eval-verifier agent to verify all evals in .claude/evals/
For each .yaml file:
1. Read the eval spec
2. Run all checks
3. Collect evidence
4. Generate tests
5. Report results
Summarize overall results at the end.
/eval verify
```
### /eval evidence <name>
Show collected evidence for an eval:
```bash
echo "Evidence for: $1"
echo ""
if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
cat ".claude/evals/.evidence/$1/evidence.json"
else
echo "No evidence collected yet. Run: /eval verify $1"
fi
Show collected evidence:
```
Evidence: auth
- api-login-001.png
- ui-login-001.png
- evidence.json
```
### /eval tests
List generated tests:
```bash
echo "Generated tests:"
echo ""
if [ -d "tests/generated" ]; then
ls -la tests/generated/
else
echo "No tests generated yet."
fi
```
Generated tests:
tests/generated/test_auth_api_login.py
tests/generated/test_auth_ui_login.py
```
### /eval clean
Clean evidence and generated tests:
Remove evidence and generated tests.
```bash
rm -rf .claude/evals/.evidence/
rm -rf tests/generated/
echo "Cleaned evidence and generated tests."
## Orchestration Logic
For `/eval build`:
```python
max_iterations = 5
iteration = 0
# Initial build
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "implement"
})
while iteration < max_iterations:
# Verify
verifier_result = spawn_agent("eval-verifier", {
"spec": f".claude/evals/{name}.yaml"
})
if verifier_result.all_passed:
return success(verifier_result)
# Fix failures
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "fix",
"failures": verifier_result.failures
})
iteration += 1
return failure("Max iterations reached")
```
## Workflow
## Context Flow
```
1. Create eval spec
> Create evals for user authentication
2. List evals
> /eval list
3. Show specific eval
> /eval show auth
4. Run verification
> /eval verify auth
5. Check evidence
> /eval evidence auth
6. Run generated tests
> pytest tests/generated/
Main Claude
├─→ Builder (context: building_spec only)
│ └─→ Returns: files created
├─→ Verifier (context: verification_spec only)
│ └─→ Returns: pass/fail + evidence
└─→ Builder (context: building_spec + failure only)
└─→ Returns: files fixed
```
## Output Examples
### /eval list
```
Available evals:
auth Email/password authentication with UI and API
todo-api REST API for todo management
checkout E-commerce checkout flow
```
### /eval verify auth
```
🔍 Eval: auth
═══════════════════════════════════════
Deterministic Checks:
✅ command: npm test -- --grep 'auth' (exit 0)
✅ file-contains: bcrypt in password.ts
Agent Checks:
✅ api-login: JWT returned correctly
📄 Test: tests/generated/test_auth_api_login.py
✅ ui-login: Dashboard redirect works
📸 Evidence: 2 screenshots
📄 Test: tests/generated/test_auth_ui_login.py
═══════════════════════════════════════
📊 Results: 4/4 passed
```
Each agent gets minimal, focused context. No bloat.