--- description: Eval commands - list, show, build, verify argument-hint: list | show | build | verify allowed-tools: Read, Bash, Task --- # /eval Command ## Commands ### /eval list List all evals: ``` Available evals: auth Email/password authentication checkout E-commerce checkout flow ``` ### /eval show Display the full eval spec. ### /eval build **The main command.** Orchestrates build → verify → fix loop. ``` /eval build auth ``` Flow: 1. Spawn **eval-builder** with building_spec 2. Builder implements, returns 3. Spawn **eval-verifier** with verification_spec 4. Verifier checks, returns pass/fail 5. If fail → spawn builder with failure context → goto 3 6. If pass → done Output: ``` 🔨 Building: auth ═══════════════════════════════════════ [Builder] Implementing... + src/auth/password.ts + src/auth/jwt.ts + src/routes/auth.ts [Verifier] Checking... ✅ command: npm test (exit 0) ✅ file-contains: bcrypt ❌ api-login: Wrong status code Expected: 401 on bad password Actual: 500 [Builder] Fixing api-login... ~ src/routes/auth.ts [Verifier] Re-checking... ✅ command: npm test (exit 0) ✅ file-contains: bcrypt ✅ api-login: Correct responses 📄 Test: tests/generated/test_auth_api_login.py ═══════════════════════════════════════ 📊 Build complete: 3/3 checks passed Iterations: 2 Tests generated: 1 ``` ### /eval verify Just verify, don't build. For checking existing code. ``` /eval verify auth ``` Spawns verifier only. Reports pass/fail with evidence. ### /eval verify Run all evals: ``` /eval verify ``` ### /eval evidence Show collected evidence: ``` Evidence: auth - api-login-001.png - ui-login-001.png - evidence.json ``` ### /eval tests List generated tests: ``` Generated tests: tests/generated/test_auth_api_login.py tests/generated/test_auth_ui_login.py ``` ### /eval clean Remove evidence and generated tests. ## Orchestration Logic For `/eval build`: ```python max_iterations = 5 iteration = 0 # Initial build builder_result = spawn_agent("eval-builder", { "spec": f".claude/evals/{name}.yaml", "task": "implement" }) while iteration < max_iterations: # Verify verifier_result = spawn_agent("eval-verifier", { "spec": f".claude/evals/{name}.yaml" }) if verifier_result.all_passed: return success(verifier_result) # Fix failures builder_result = spawn_agent("eval-builder", { "spec": f".claude/evals/{name}.yaml", "task": "fix", "failures": verifier_result.failures }) iteration += 1 return failure("Max iterations reached") ``` ## Context Flow ``` Main Claude │ ├─→ Builder (context: building_spec only) │ └─→ Returns: files created │ ├─→ Verifier (context: verification_spec only) │ └─→ Returns: pass/fail + evidence │ └─→ Builder (context: building_spec + failure only) └─→ Returns: files fixed ``` Each agent gets minimal, focused context. No bloat.