eval-skill/commands/eval.md
2026-01-14 00:07:28 -08:00

3.2 KiB

description argument-hint allowed-tools
Run eval commands - list, show, or verify evals list | show <name> | verify [name] Read, Bash, Task

/eval Command

Interface for the eval system. I dispatch to the right action.

Commands

/eval list

List all eval specs:

echo "Available evals:"
echo ""
for f in .claude/evals/*.yaml 2>/dev/null; do
  if [ -f "$f" ]; then
    name=$(basename "$f" .yaml)
    desc=$(grep "^description:" "$f" | head -1 | sed 's/description: *//')
    printf "  %-20s %s\n" "$name" "$desc"
  fi
done

If no evals exist:

No evals found in .claude/evals/

Create evals by asking: "Create evals for [feature]"

/eval show

Display an eval spec:

cat ".claude/evals/$1.yaml"

/eval verify [name]

Run verification. This spawns the eval-verifier subagent.

With name specified (/eval verify auth):

Delegate to eval-verifier agent:

Run the eval-verifier agent to verify .claude/evals/auth.yaml

The agent should:
1. Read the eval spec
2. Run all checks in the verify list
3. Collect evidence for agent checks
4. Generate tests where generate_test: true
5. Report results with evidence

Without name (/eval verify):

Run all evals:

Run the eval-verifier agent to verify all evals in .claude/evals/

For each .yaml file:
1. Read the eval spec
2. Run all checks
3. Collect evidence
4. Generate tests
5. Report results

Summarize overall results at the end.

/eval evidence

Show collected evidence for an eval:

echo "Evidence for: $1"
echo ""
if [ -f ".claude/evals/.evidence/$1/evidence.json" ]; then
  cat ".claude/evals/.evidence/$1/evidence.json"
else
  echo "No evidence collected yet. Run: /eval verify $1"
fi

/eval tests

List generated tests:

echo "Generated tests:"
echo ""
if [ -d "tests/generated" ]; then
  ls -la tests/generated/
else
  echo "No tests generated yet."
fi

/eval clean

Clean evidence and generated tests:

rm -rf .claude/evals/.evidence/
rm -rf tests/generated/
echo "Cleaned evidence and generated tests."

Workflow

1. Create eval spec
   > Create evals for user authentication

2. List evals
   > /eval list
   
3. Show specific eval  
   > /eval show auth

4. Run verification
   > /eval verify auth
   
5. Check evidence
   > /eval evidence auth

6. Run generated tests
   > pytest tests/generated/

Output Examples

/eval list

Available evals:

  auth                 Email/password authentication with UI and API
  todo-api             REST API for todo management
  checkout             E-commerce checkout flow

/eval verify auth

🔍 Eval: auth
═══════════════════════════════════════

Deterministic Checks:
  ✅ command: npm test -- --grep 'auth' (exit 0)
  ✅ file-contains: bcrypt in password.ts

Agent Checks:
  ✅ api-login: JWT returned correctly
     📄 Test: tests/generated/test_auth_api_login.py
  ✅ ui-login: Dashboard redirect works
     📸 Evidence: 2 screenshots
     📄 Test: tests/generated/test_auth_ui_login.py

═══════════════════════════════════════
📊 Results: 4/4 passed