mirror of
https://github.com/harivansh-afk/eval-skill.git
synced 2026-04-15 06:04:42 +00:00
3.2 KiB
3.2 KiB
| description | argument-hint | allowed-tools |
|---|---|---|
| Eval commands - list, show, build, verify | list | show <name> | build <name> | verify <name> | Read, Bash, Task |
/eval Command
Commands
/eval list
List all evals:
Available evals:
auth Email/password authentication
checkout E-commerce checkout flow
/eval show
Display the full eval spec.
/eval build
The main command. Orchestrates build → verify → fix loop.
/eval build auth
Flow:
- Spawn eval-builder with building_spec
- Builder implements, returns
- Spawn eval-verifier with verification_spec
- Verifier checks, returns pass/fail
- If fail → spawn builder with failure context → goto 3
- If pass → done
Output:
🔨 Building: auth
═══════════════════════════════════════
[Builder] Implementing...
+ src/auth/password.ts
+ src/auth/jwt.ts
+ src/routes/auth.ts
[Verifier] Checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
❌ api-login: Wrong status code
Expected: 401 on bad password
Actual: 500
[Builder] Fixing api-login...
~ src/routes/auth.ts
[Verifier] Re-checking...
✅ command: npm test (exit 0)
✅ file-contains: bcrypt
✅ api-login: Correct responses
📄 Test: tests/generated/test_auth_api_login.py
═══════════════════════════════════════
📊 Build complete: 3/3 checks passed
Iterations: 2
Tests generated: 1
/eval verify
Just verify, don't build. For checking existing code.
/eval verify auth
Spawns verifier only. Reports pass/fail with evidence.
/eval verify
Run all evals:
/eval verify
/eval evidence
Show collected evidence:
Evidence: auth
- api-login-001.png
- ui-login-001.png
- evidence.json
/eval tests
List generated tests:
Generated tests:
tests/generated/test_auth_api_login.py
tests/generated/test_auth_ui_login.py
/eval clean
Remove evidence and generated tests.
Orchestration Logic
For /eval build:
max_iterations = 5
iteration = 0
# Initial build
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "implement"
})
while iteration < max_iterations:
# Verify
verifier_result = spawn_agent("eval-verifier", {
"spec": f".claude/evals/{name}.yaml"
})
if verifier_result.all_passed:
return success(verifier_result)
# Fix failures
builder_result = spawn_agent("eval-builder", {
"spec": f".claude/evals/{name}.yaml",
"task": "fix",
"failures": verifier_result.failures
})
iteration += 1
return failure("Max iterations reached")
Context Flow
Main Claude
│
├─→ Builder (context: building_spec only)
│ └─→ Returns: files created
│
├─→ Verifier (context: verification_spec only)
│ └─→ Returns: pass/fail + evidence
│
└─→ Builder (context: building_spec + failure only)
└─→ Returns: files fixed
Each agent gets minimal, focused context. No bloat.