No description
Find a file
2026-01-14 13:58:05 -08:00
agents iterate 2026-01-14 00:11:59 -08:00
commands iterate 2026-01-14 00:11:59 -08:00
skills implement 2026-01-14 13:58:05 -08:00
.gitignore init 2026-01-14 00:07:28 -08:00
install.sh install 2026-01-14 00:15:29 -08:00
README.md install 2026-01-14 00:15:29 -08:00

eval-skill

curl -fsSL https://raw.githubusercontent.com/harivansh-afk/eval-skill/main/install.sh | bash

Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify.

Why

"How will the agent know it did the right thing?"

Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right.

How It Works

You: "Build auth with email/password"
        │
        ▼
┌─────────────────────────────────────┐
│  Skill: eval                        │
│  Generates:                         │
│    • verification spec (tests)      │
│    • building spec (what to build)  │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│  Agent: builder                     │
│  Implements from building spec      │
│  Clean context, focused on code     │
└─────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────┐
│  Agent: verifier                    │
│  Runs checks, collects evidence     │
│  Returns pass/fail                  │
└─────────────────────────────────────┘
        │
        ▼
    Pass? Done.
    Fail? → Builder fixes → Verifier checks → Loop

Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient.

Install

git clone https://github.com/yourusername/eval-skill.git
cd eval-skill
./install.sh           # Current project
./install.sh --global  # All projects

Usage

Step 1: Create Specs

Create evals for user authentication with email/password

Creates .claude/evals/auth.yaml:

name: auth

building_spec:
  description: Email/password auth with login/signup
  requirements:
    - Password hashing with bcrypt
    - JWT tokens on login
    - /login and /signup endpoints

verification_spec:
  - type: command
    run: "npm test -- --grep auth"
    expect: exit_code 0
    
  - type: file-contains
    path: src/auth/password.ts
    pattern: "bcrypt"
    
  - type: agent
    name: login-flow
    prompt: |
      1. POST /api/login with valid creds
      2. Verify JWT in response
      3. POST with wrong password
      4. Verify 401 + helpful error
    generate_test: true

Step 2: Build

/eval build auth

Spawns builder agent → implements → spawns verifier → checks → iterates until pass.

Step 3: Run Generated Tests (Forever)

pytest tests/generated/

Agent checks become deterministic tests. First run costs tokens. Future runs are free.

Commands

Command What it does
/eval list List all evals
/eval show <name> Display spec
/eval build <name> Build + verify loop
/eval verify <name> Just verify, no build

Why Context Isolation Matters

Without isolation:

Main Claude context:
  - All verification logic
  - All implementation code  
  - All error history
  - Context bloat → degraded performance

With isolation:

Builder context: building spec + current failure only
Verifier context: verification spec + current code only
Main Claude: just orchestration

Each agent gets exactly what it needs. Nothing more.

Check Types

Deterministic (fast, no agent):

- type: command
  run: "npm test"
  expect: exit_code 0
  
- type: file-contains
  path: src/auth.ts
  pattern: "bcrypt"

Agent (semantic, generates tests):

- type: agent
  name: ui-login
  prompt: "Navigate to /login, submit form, verify redirect"
  evidence:
    - screenshot: after-login
    - url: contains "/dashboard"
  generate_test: true

Agent checks produce evidence (screenshots, responses) and become executable tests.

Directory Structure

.claude/
├── skills/eval/       # Generates specs
├── agents/
│   ├── eval-builder.md
│   └── eval-verifier.md
├── commands/eval.md
└── evals/
    ├── auth.yaml
    └── .evidence/     # Screenshots, logs

tests/generated/       # Tests from agent checks

Requirements

  • Claude Code
  • For UI testing: npm install -g @anthropic/agent-browser

License

MIT