eval-skill/commands/eval.md
2026-01-14 00:11:59 -08:00

3.2 KiB

description argument-hint allowed-tools
Eval commands - list, show, build, verify list | show <name> | build <name> | verify <name> Read, Bash, Task

/eval Command

Commands

/eval list

List all evals:

Available evals:
  auth         Email/password authentication
  checkout     E-commerce checkout flow

/eval show

Display the full eval spec.

/eval build

The main command. Orchestrates build → verify → fix loop.

/eval build auth

Flow:

  1. Spawn eval-builder with building_spec
  2. Builder implements, returns
  3. Spawn eval-verifier with verification_spec
  4. Verifier checks, returns pass/fail
  5. If fail → spawn builder with failure context → goto 3
  6. If pass → done

Output:

🔨 Building: auth
═══════════════════════════════════════

[Builder] Implementing...
  + src/auth/password.ts
  + src/auth/jwt.ts
  + src/routes/auth.ts

[Verifier] Checking...
  ✅ command: npm test (exit 0)
  ✅ file-contains: bcrypt
  ❌ api-login: Wrong status code
     Expected: 401 on bad password
     Actual: 500

[Builder] Fixing api-login...
  ~ src/routes/auth.ts

[Verifier] Re-checking...
  ✅ command: npm test (exit 0)
  ✅ file-contains: bcrypt
  ✅ api-login: Correct responses
     📄 Test: tests/generated/test_auth_api_login.py

═══════════════════════════════════════
📊 Build complete: 3/3 checks passed
   Iterations: 2
   Tests generated: 1

/eval verify

Just verify, don't build. For checking existing code.

/eval verify auth

Spawns verifier only. Reports pass/fail with evidence.

/eval verify

Run all evals:

/eval verify

/eval evidence

Show collected evidence:

Evidence: auth
  - api-login-001.png
  - ui-login-001.png
  - evidence.json

/eval tests

List generated tests:

Generated tests:
  tests/generated/test_auth_api_login.py
  tests/generated/test_auth_ui_login.py

/eval clean

Remove evidence and generated tests.

Orchestration Logic

For /eval build:

max_iterations = 5
iteration = 0

# Initial build
builder_result = spawn_agent("eval-builder", {
    "spec": f".claude/evals/{name}.yaml",
    "task": "implement"
})

while iteration < max_iterations:
    # Verify
    verifier_result = spawn_agent("eval-verifier", {
        "spec": f".claude/evals/{name}.yaml"
    })
    
    if verifier_result.all_passed:
        return success(verifier_result)
    
    # Fix failures
    builder_result = spawn_agent("eval-builder", {
        "spec": f".claude/evals/{name}.yaml",
        "task": "fix",
        "failures": verifier_result.failures
    })
    
    iteration += 1

return failure("Max iterations reached")

Context Flow

Main Claude
    │
    ├─→ Builder (context: building_spec only)
    │   └─→ Returns: files created
    │
    ├─→ Verifier (context: verification_spec only)
    │   └─→ Returns: pass/fail + evidence
    │
    └─→ Builder (context: building_spec + failure only)
        └─→ Returns: files fixed

Each agent gets minimal, focused context. No bloat.