iterate

2026-04-15 06:04:42 +00:00 · 2026-01-14 00:11:59 -08:00 · 2026-01-14 00:11:59 -08:00 · 7c63331389
commit 7c63331389
parent aca2126c88
5 changed files with 520 additions and 664 deletions
--- a/commands/eval.md
+++ b/commands/eval.md
@ -1,169 +1,162 @@
 ---
-description: Run eval commands - list, show, or verify evals
-argument-hint: list | show <name> | verify [name]
+description: Eval commands - list, show, build, verify
+argument-hint: list | show <name> | build <name> | verify <name>
 allowed-tools: Read, Bash, Task
 ---

 # /eval Command

-Interface for the eval system. I dispatch to the right action.
-
 ## Commands

 ### /eval list

-List all eval specs:
-
-```bash
-echo "Available evals:"
-echo ""
-for f in .claude/evals/*.yaml 2>/dev/null; do
-  if [ -f "$f" ]; then
-    name=$(basename "$f" .yaml)
-    desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
-    printf "  %-20s %s\n" "$name" "$desc"
-  fi
-done
+List all evals:
 ```
-
-If no evals exist:
-```
-No evals found in .claude/evals/
-
-Create evals by asking: "Create evals for [feature]"
+Available evals:
+  auth         Email/password authentication
+  checkout     E-commerce checkout flow
 ```

 ### /eval show <name>

-Display an eval spec:
+Display the full eval spec.

-```bash
-cat ".claude/evals/$1.yaml"
+### /eval build <name>
+
+**The main command.** Orchestrates build → verify → fix loop.
+
+```
+/eval build auth
 ```

-### /eval verify [name]
+Flow:
+1. Spawn **eval-builder** with building_spec
+2. Builder implements, returns
+3. Spawn **eval-verifier** with verification_spec
+4. Verifier checks, returns pass/fail
+5. If fail → spawn builder with failure context → goto 3
+6. If pass → done

-Run verification. This spawns the `eval-verifier` subagent.
-
-**With name specified** (`/eval verify auth`):
-
-Delegate to eval-verifier agent:
+Output:
 ```
-Run the eval-verifier agent to verify .claude/evals/auth.yaml
+🔨 Building: auth
+═══════════════════════════════════════

-The agent should:
-1. Read the eval spec
-2. Run all checks in the verify list
-3. Collect evidence for agent checks
-4. Generate tests where generate_test: true
-5. Report results with evidence
+[Builder] Implementing...
+  + src/auth/password.ts
+  + src/auth/jwt.ts
+  + src/routes/auth.ts
+
+[Verifier] Checking...
+  ✅ command: npm test (exit 0)
+  ✅ file-contains: bcrypt
+  ❌ api-login: Wrong status code
+     Expected: 401 on bad password
+     Actual: 500
+
+[Builder] Fixing api-login...
+  ~ src/routes/auth.ts
+
+[Verifier] Re-checking...
+  ✅ command: npm test (exit 0)
+  ✅ file-contains: bcrypt
+  ✅ api-login: Correct responses
+     📄 Test: tests/generated/test_auth_api_login.py
+
+═══════════════════════════════════════
+📊 Build complete: 3/3 checks passed
+   Iterations: 2
+   Tests generated: 1
 ```

-**Without name** (`/eval verify`):
+### /eval verify <name>
+
+Just verify, don't build. For checking existing code.
+
+```
+/eval verify auth
+```
+
+Spawns verifier only. Reports pass/fail with evidence.
+
+### /eval verify

 Run all evals:
 ```
-Run the eval-verifier agent to verify all evals in .claude/evals/
-
-For each .yaml file:
-1. Read the eval spec
-2. Run all checks
-3. Collect evidence
-4. Generate tests
-5. Report results
-
-Summarize overall results at the end.
+/eval verify
 ```

 ### /eval evidence <name>

-Show collected evidence for an eval:
-
-```bash
-echo "Evidence for: $1"
-echo ""
-if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
-  cat ".claude/evals/.evidence/$1/evidence.json"
-else
-  echo "No evidence collected yet. Run: /eval verify $1"
-fi
+Show collected evidence:
+```
+Evidence: auth
+  - api-login-001.png
+  - ui-login-001.png
+  - evidence.json
 ```

 ### /eval tests

 List generated tests:
-
-```bash
-echo "Generated tests:"
-echo ""
-if [ -d "tests/generated" ]; then
-  ls -la tests/generated/
-else
-  echo "No tests generated yet."
-fi
+```
+Generated tests:
+  tests/generated/test_auth_api_login.py
+  tests/generated/test_auth_ui_login.py
 ```

 ### /eval clean

-Clean evidence and generated tests:
+Remove evidence and generated tests.

-```bash
-rm -rf .claude/evals/.evidence/
-rm -rf tests/generated/
-echo "Cleaned evidence and generated tests."
+## Orchestration Logic
+
+For `/eval build`:
+
+```python
+max_iterations = 5
+iteration = 0
+
+# Initial build
+builder_result = spawn_agent("eval-builder", {
+    "spec": f".claude/evals/{name}.yaml",
+    "task": "implement"
+})
+
+while iteration < max_iterations:
+    # Verify
+    verifier_result = spawn_agent("eval-verifier", {
+        "spec": f".claude/evals/{name}.yaml"
+    })
+    
+    if verifier_result.all_passed:
+        return success(verifier_result)
+    
+    # Fix failures
+    builder_result = spawn_agent("eval-builder", {
+        "spec": f".claude/evals/{name}.yaml",
+        "task": "fix",
+        "failures": verifier_result.failures
+    })
+    
+    iteration += 1
+
+return failure("Max iterations reached")
 ```

-## Workflow
+## Context Flow

 ```
-1. Create eval spec
-   > Create evals for user authentication
-
-2. List evals
-   > /eval list
-   
-3. Show specific eval  
-   > /eval show auth
-
-4. Run verification
-   > /eval verify auth
-   
-5. Check evidence
-   > /eval evidence auth
-
-6. Run generated tests
-   > pytest tests/generated/
+Main Claude
+    │
+    ├─→ Builder (context: building_spec only)
+    │   └─→ Returns: files created
+    │
+    ├─→ Verifier (context: verification_spec only)
+    │   └─→ Returns: pass/fail + evidence
+    │
+    └─→ Builder (context: building_spec + failure only)
+        └─→ Returns: files fixed
 ```

-## Output Examples
-
-### /eval list
-
-```
-Available evals:
-
-  auth                 Email/password authentication with UI and API
-  todo-api             REST API for todo management
-  checkout             E-commerce checkout flow
-```
-
-### /eval verify auth
-
-```
-🔍 Eval: auth
-═══════════════════════════════════════
-
-Deterministic Checks:
-  ✅ command: npm test -- --grep 'auth' (exit 0)
-  ✅ file-contains: bcrypt in password.ts
-
-Agent Checks:
-  ✅ api-login: JWT returned correctly
-     📄 Test: tests/generated/test_auth_api_login.py
-  ✅ ui-login: Dashboard redirect works
-     📸 Evidence: 2 screenshots
-     📄 Test: tests/generated/test_auth_ui_login.py
-
-═══════════════════════════════════════
-📊 Results: 4/4 passed
-```
+Each agent gets minimal, focused context. No bloat.