mirror of https://github.com/harivansh-afk/eval-skill.git synced 2026-04-15 04:03:29 +00:00

2026-01-14 00:11:59 -08:00

3.2 KiB

Raw Blame History

description	argument-hint	allowed-tools
Eval commands - list, show, build, verify	list \| show <name> \| build <name> \| verify <name>	Read, Bash, Task

/eval Command

Commands

/eval list

List all evals:

Available evals:
  auth         Email/password authentication
  checkout     E-commerce checkout flow

/eval show

Display the full eval spec.

/eval build

The main command. Orchestrates build → verify → fix loop.

/eval build auth

Flow:

Spawn eval-builder with building_spec
Builder implements, returns
Spawn eval-verifier with verification_spec
Verifier checks, returns pass/fail
If fail → spawn builder with failure context → goto 3
If pass → done

Output:

🔨 Building: auth
═══════════════════════════════════════

[Builder] Implementing...
  + src/auth/password.ts
  + src/auth/jwt.ts
  + src/routes/auth.ts

[Verifier] Checking...
  ✅ command: npm test (exit 0)
  ✅ file-contains: bcrypt
  ❌ api-login: Wrong status code
     Expected: 401 on bad password
     Actual: 500

[Builder] Fixing api-login...
  ~ src/routes/auth.ts

[Verifier] Re-checking...
  ✅ command: npm test (exit 0)
  ✅ file-contains: bcrypt
  ✅ api-login: Correct responses
     📄 Test: tests/generated/test_auth_api_login.py

═══════════════════════════════════════
📊 Build complete: 3/3 checks passed
   Iterations: 2
   Tests generated: 1

/eval verify

Just verify, don't build. For checking existing code.

/eval verify auth

Spawns verifier only. Reports pass/fail with evidence.

/eval verify

Run all evals:

/eval verify

/eval evidence

Show collected evidence:

Evidence: auth
  - api-login-001.png
  - ui-login-001.png
  - evidence.json

/eval tests

List generated tests:

Generated tests:
  tests/generated/test_auth_api_login.py
  tests/generated/test_auth_ui_login.py

/eval clean

Remove evidence and generated tests.

Orchestration Logic

For /eval build:

max_iterations = 5
iteration = 0

# Initial build
builder_result = spawn_agent("eval-builder", {
    "spec": f".claude/evals/{name}.yaml",
    "task": "implement"
})

while iteration < max_iterations:
    # Verify
    verifier_result = spawn_agent("eval-verifier", {
        "spec": f".claude/evals/{name}.yaml"
    })
    
    if verifier_result.all_passed:
        return success(verifier_result)
    
    # Fix failures
    builder_result = spawn_agent("eval-builder", {
        "spec": f".claude/evals/{name}.yaml",
        "task": "fix",
        "failures": verifier_result.failures
    })
    
    iteration += 1

return failure("Max iterations reached")

Context Flow

Main Claude
    │
    ├─→ Builder (context: building_spec only)
    │   └─→ Returns: files created
    │
    ├─→ Verifier (context: verification_spec only)
    │   └─→ Returns: pass/fail + evidence
    │
    └─→ Builder (context: building_spec + failure only)
        └─→ Returns: files fixed

Each agent gets minimal, focused context. No bloat.

3.2 KiB Raw Blame History

/eval Command

Commands

/eval list

/eval show

/eval build

/eval verify

/eval verify

/eval evidence

/eval tests

/eval clean

Orchestration Logic

Context Flow

3.2 KiB

Raw Blame History