mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 06:04:42 +00:00
iterate
This commit is contained in:
parent
aca2126c88
commit
7c63331389
5 changed files with 520 additions and 664 deletions
239
commands/eval.md
239
commands/eval.md
|
|
@ -1,169 +1,162 @@
|
|||
---
|
||||
description: Run eval commands - list, show, or verify evals
|
||||
argument-hint: list | show <name> | verify [name]
|
||||
description: Eval commands - list, show, build, verify
|
||||
argument-hint: list | show <name> | build <name> | verify <name>
|
||||
allowed-tools: Read, Bash, Task
|
||||
---
|
||||
|
||||
# /eval Command
|
||||
|
||||
Interface for the eval system. I dispatch to the right action.
|
||||
|
||||
## Commands
|
||||
|
||||
### /eval list
|
||||
|
||||
List all eval specs:
|
||||
|
||||
```bash
|
||||
echo "Available evals:"
|
||||
echo ""
|
||||
for f in .claude/evals/*.yaml 2>/dev/null; do
|
||||
if [ -f "$f" ]; then
|
||||
name=$(basename "$f" .yaml)
|
||||
desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
|
||||
printf " %-20s %s\n" "$name" "$desc"
|
||||
fi
|
||||
done
|
||||
List all evals:
|
||||
```
|
||||
|
||||
If no evals exist:
|
||||
```
|
||||
No evals found in .claude/evals/
|
||||
|
||||
Create evals by asking: "Create evals for [feature]"
|
||||
Available evals:
|
||||
auth Email/password authentication
|
||||
checkout E-commerce checkout flow
|
||||
```
|
||||
|
||||
### /eval show <name>
|
||||
|
||||
Display an eval spec:
|
||||
Display the full eval spec.
|
||||
|
||||
```bash
|
||||
cat ".claude/evals/$1.yaml"
|
||||
### /eval build <name>
|
||||
|
||||
**The main command.** Orchestrates build → verify → fix loop.
|
||||
|
||||
```
|
||||
/eval build auth
|
||||
```
|
||||
|
||||
### /eval verify [name]
|
||||
Flow:
|
||||
1. Spawn **eval-builder** with building_spec
|
||||
2. Builder implements, returns
|
||||
3. Spawn **eval-verifier** with verification_spec
|
||||
4. Verifier checks, returns pass/fail
|
||||
5. If fail → spawn builder with failure context → goto 3
|
||||
6. If pass → done
|
||||
|
||||
Run verification. This spawns the `eval-verifier` subagent.
|
||||
|
||||
**With name specified** (`/eval verify auth`):
|
||||
|
||||
Delegate to eval-verifier agent:
|
||||
Output:
|
||||
```
|
||||
Run the eval-verifier agent to verify .claude/evals/auth.yaml
|
||||
🔨 Building: auth
|
||||
═══════════════════════════════════════
|
||||
|
||||
The agent should:
|
||||
1. Read the eval spec
|
||||
2. Run all checks in the verify list
|
||||
3. Collect evidence for agent checks
|
||||
4. Generate tests where generate_test: true
|
||||
5. Report results with evidence
|
||||
[Builder] Implementing...
|
||||
+ src/auth/password.ts
|
||||
+ src/auth/jwt.ts
|
||||
+ src/routes/auth.ts
|
||||
|
||||
[Verifier] Checking...
|
||||
✅ command: npm test (exit 0)
|
||||
✅ file-contains: bcrypt
|
||||
❌ api-login: Wrong status code
|
||||
Expected: 401 on bad password
|
||||
Actual: 500
|
||||
|
||||
[Builder] Fixing api-login...
|
||||
~ src/routes/auth.ts
|
||||
|
||||
[Verifier] Re-checking...
|
||||
✅ command: npm test (exit 0)
|
||||
✅ file-contains: bcrypt
|
||||
✅ api-login: Correct responses
|
||||
📄 Test: tests/generated/test_auth_api_login.py
|
||||
|
||||
═══════════════════════════════════════
|
||||
📊 Build complete: 3/3 checks passed
|
||||
Iterations: 2
|
||||
Tests generated: 1
|
||||
```
|
||||
|
||||
**Without name** (`/eval verify`):
|
||||
### /eval verify <name>
|
||||
|
||||
Just verify, don't build. For checking existing code.
|
||||
|
||||
```
|
||||
/eval verify auth
|
||||
```
|
||||
|
||||
Spawns verifier only. Reports pass/fail with evidence.
|
||||
|
||||
### /eval verify
|
||||
|
||||
Run all evals:
|
||||
```
|
||||
Run the eval-verifier agent to verify all evals in .claude/evals/
|
||||
|
||||
For each .yaml file:
|
||||
1. Read the eval spec
|
||||
2. Run all checks
|
||||
3. Collect evidence
|
||||
4. Generate tests
|
||||
5. Report results
|
||||
|
||||
Summarize overall results at the end.
|
||||
/eval verify
|
||||
```
|
||||
|
||||
### /eval evidence <name>
|
||||
|
||||
Show collected evidence for an eval:
|
||||
|
||||
```bash
|
||||
echo "Evidence for: $1"
|
||||
echo ""
|
||||
if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
|
||||
cat ".claude/evals/.evidence/$1/evidence.json"
|
||||
else
|
||||
echo "No evidence collected yet. Run: /eval verify $1"
|
||||
fi
|
||||
Show collected evidence:
|
||||
```
|
||||
Evidence: auth
|
||||
- api-login-001.png
|
||||
- ui-login-001.png
|
||||
- evidence.json
|
||||
```
|
||||
|
||||
### /eval tests
|
||||
|
||||
List generated tests:
|
||||
|
||||
```bash
|
||||
echo "Generated tests:"
|
||||
echo ""
|
||||
if [ -d "tests/generated" ]; then
|
||||
ls -la tests/generated/
|
||||
else
|
||||
echo "No tests generated yet."
|
||||
fi
|
||||
```
|
||||
Generated tests:
|
||||
tests/generated/test_auth_api_login.py
|
||||
tests/generated/test_auth_ui_login.py
|
||||
```
|
||||
|
||||
### /eval clean
|
||||
|
||||
Clean evidence and generated tests:
|
||||
Remove evidence and generated tests.
|
||||
|
||||
```bash
|
||||
rm -rf .claude/evals/.evidence/
|
||||
rm -rf tests/generated/
|
||||
echo "Cleaned evidence and generated tests."
|
||||
## Orchestration Logic
|
||||
|
||||
For `/eval build`:
|
||||
|
||||
```python
|
||||
max_iterations = 5
|
||||
iteration = 0
|
||||
|
||||
# Initial build
|
||||
builder_result = spawn_agent("eval-builder", {
|
||||
"spec": f".claude/evals/{name}.yaml",
|
||||
"task": "implement"
|
||||
})
|
||||
|
||||
while iteration < max_iterations:
|
||||
# Verify
|
||||
verifier_result = spawn_agent("eval-verifier", {
|
||||
"spec": f".claude/evals/{name}.yaml"
|
||||
})
|
||||
|
||||
if verifier_result.all_passed:
|
||||
return success(verifier_result)
|
||||
|
||||
# Fix failures
|
||||
builder_result = spawn_agent("eval-builder", {
|
||||
"spec": f".claude/evals/{name}.yaml",
|
||||
"task": "fix",
|
||||
"failures": verifier_result.failures
|
||||
})
|
||||
|
||||
iteration += 1
|
||||
|
||||
return failure("Max iterations reached")
|
||||
```
|
||||
|
||||
## Workflow
|
||||
## Context Flow
|
||||
|
||||
```
|
||||
1. Create eval spec
|
||||
> Create evals for user authentication
|
||||
|
||||
2. List evals
|
||||
> /eval list
|
||||
|
||||
3. Show specific eval
|
||||
> /eval show auth
|
||||
|
||||
4. Run verification
|
||||
> /eval verify auth
|
||||
|
||||
5. Check evidence
|
||||
> /eval evidence auth
|
||||
|
||||
6. Run generated tests
|
||||
> pytest tests/generated/
|
||||
Main Claude
|
||||
│
|
||||
├─→ Builder (context: building_spec only)
|
||||
│ └─→ Returns: files created
|
||||
│
|
||||
├─→ Verifier (context: verification_spec only)
|
||||
│ └─→ Returns: pass/fail + evidence
|
||||
│
|
||||
└─→ Builder (context: building_spec + failure only)
|
||||
└─→ Returns: files fixed
|
||||
```
|
||||
|
||||
## Output Examples
|
||||
|
||||
### /eval list
|
||||
|
||||
```
|
||||
Available evals:
|
||||
|
||||
auth Email/password authentication with UI and API
|
||||
todo-api REST API for todo management
|
||||
checkout E-commerce checkout flow
|
||||
```
|
||||
|
||||
### /eval verify auth
|
||||
|
||||
```
|
||||
🔍 Eval: auth
|
||||
═══════════════════════════════════════
|
||||
|
||||
Deterministic Checks:
|
||||
✅ command: npm test -- --grep 'auth' (exit 0)
|
||||
✅ file-contains: bcrypt in password.ts
|
||||
|
||||
Agent Checks:
|
||||
✅ api-login: JWT returned correctly
|
||||
📄 Test: tests/generated/test_auth_api_login.py
|
||||
✅ ui-login: Dashboard redirect works
|
||||
📸 Evidence: 2 screenshots
|
||||
📄 Test: tests/generated/test_auth_ui_login.py
|
||||
|
||||
═══════════════════════════════════════
|
||||
📊 Results: 4/4 passed
|
||||
```
|
||||
Each agent gets minimal, focused context. No bloat.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue