# eval-skill ```bash curl -fsSL https://raw.githubusercontent.com/harivansh-afk/eval-skill/main/install.sh | bash ``` Verification-first development for Claude Code. Define what success looks like, then let Claude build and verify. ## Why > *"How will the agent know it did the right thing?"* Without a feedback loop, Claude implements and hopes. With one, Claude implements, checks, and iterates until it's right. ## How It Works ``` You: "Build auth with email/password" │ ▼ ┌─────────────────────────────────────┐ │ Skill: eval │ │ Generates: │ │ • verification spec (tests) │ │ • building spec (what to build) │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Agent: builder │ │ Implements from building spec │ │ Clean context, focused on code │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Agent: verifier │ │ Runs checks, collects evidence │ │ Returns pass/fail │ └─────────────────────────────────────┘ │ ▼ Pass? Done. Fail? → Builder fixes → Verifier checks → Loop ``` Each agent has isolated context. Builder doesn't hold verification logic. Verifier doesn't hold implementation details. Clean, focused, efficient. ## Install ```bash git clone https://github.com/yourusername/eval-skill.git cd eval-skill ./install.sh # Current project ./install.sh --global # All projects ``` ## Usage ### Step 1: Create Specs ``` Create evals for user authentication with email/password ``` Creates `.claude/evals/auth.yaml`: ```yaml name: auth building_spec: description: Email/password auth with login/signup requirements: - Password hashing with bcrypt - JWT tokens on login - /login and /signup endpoints verification_spec: - type: command run: "npm test -- --grep auth" expect: exit_code 0 - type: file-contains path: src/auth/password.ts pattern: "bcrypt" - type: agent name: login-flow prompt: | 1. POST /api/login with valid creds 2. Verify JWT in response 3. POST with wrong password 4. Verify 401 + helpful error generate_test: true ``` ### Step 2: Build ``` /eval build auth ``` Spawns builder agent → implements → spawns verifier → checks → iterates until pass. ### Step 3: Run Generated Tests (Forever) ```bash pytest tests/generated/ ``` Agent checks become deterministic tests. First run costs tokens. Future runs are free. ## Commands | Command | What it does | |---------|--------------| | `/eval list` | List all evals | | `/eval show ` | Display spec | | `/eval build ` | Build + verify loop | | `/eval verify ` | Just verify, no build | ## Why Context Isolation Matters **Without isolation:** ``` Main Claude context: - All verification logic - All implementation code - All error history - Context bloat → degraded performance ``` **With isolation:** ``` Builder context: building spec + current failure only Verifier context: verification spec + current code only Main Claude: just orchestration ``` Each agent gets exactly what it needs. Nothing more. ## Check Types **Deterministic** (fast, no agent): ```yaml - type: command run: "npm test" expect: exit_code 0 - type: file-contains path: src/auth.ts pattern: "bcrypt" ``` **Agent** (semantic, generates tests): ```yaml - type: agent name: ui-login prompt: "Navigate to /login, submit form, verify redirect" evidence: - screenshot: after-login - url: contains "/dashboard" generate_test: true ``` Agent checks produce evidence (screenshots, responses) and become executable tests. ## Directory Structure ``` .claude/ ├── skills/eval/ # Generates specs ├── agents/ │ ├── eval-builder.md │ └── eval-verifier.md ├── commands/eval.md └── evals/ ├── auth.yaml └── .evidence/ # Screenshots, logs tests/generated/ # Tests from agent checks ``` ## Requirements - Claude Code - For UI testing: `npm install -g @anthropic/agent-browser` ## License MIT