evaluclaude-harness/docs/05-llm-rubric-graders.md
2026-01-11 16:58:40 -05:00

7.8 KiB

5. LLM Rubric Graders - System Design

Priority: 🟢 MEDIUM — Subjective quality layer
Complexity: Medium
Effort Estimate: 4-6 hours


Overview

LLM Rubric Graders use Claude to evaluate subjective quality that deterministic tests can't measure:

  • Code readability
  • Error message helpfulness
  • Documentation quality
  • API design consistency

These complement functional tests with human-like judgment.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      LLM Grading Pipeline                       │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │    Output    │───▶│   Rubric     │───▶│   Grading    │      │
│  │   (code/    │    │   + Claude   │    │   Result     │      │
│  │    text)     │    │              │    │              │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│                            │                                    │
│                    Uses Promptfoo                               │
│                    llm-rubric assertion                         │
└─────────────────────────────────────────────────────────────────┘

Core Types

interface Rubric {
  name: string;
  description: string;
  criteria: RubricCriterion[];
  passingThreshold: number;  // 0-1
}

interface RubricCriterion {
  name: string;
  description: string;
  weight: number;            // Relative weight
  examples?: {
    good: string;
    bad: string;
  };
}

interface RubricGradingResult {
  pass: boolean;
  score: number;             // 0-1
  reason: string;
  criterionScores: {
    name: string;
    score: number;
    feedback: string;
  }[];
}

Rubric Examples

Code Quality Rubric (rubrics/code-quality.yaml)

name: code-quality
description: Evaluates generated code for quality and maintainability
passingThreshold: 0.7

criteria:
  - name: readability
    weight: 0.3
    description: Code is easy to read and understand
    examples:
      good: "Clear variable names, logical flow, proper indentation"
      bad: "Single-letter variables, deeply nested logic, inconsistent style"
  
  - name: correctness
    weight: 0.4
    description: Code correctly implements the intended behavior
    examples:
      good: "Handles edge cases, correct algorithm, proper error handling"
      bad: "Missing edge cases, off-by-one errors, swallowed exceptions"
  
  - name: efficiency
    weight: 0.2
    description: Code uses appropriate data structures and algorithms
    examples:
      good: "O(n) where O(n) is optimal, avoids unnecessary allocations"
      bad: "O(n²) when O(n) is possible, creates objects in tight loops"
  
  - name: maintainability
    weight: 0.1
    description: Code is easy to modify and extend
    examples:
      good: "Single responsibility, low coupling, clear interfaces"
      bad: "God functions, tight coupling, magic numbers"

Error Messages Rubric (rubrics/error-messages.yaml)

name: error-messages
description: Evaluates quality of error messages
passingThreshold: 0.6

criteria:
  - name: clarity
    weight: 0.4
    description: Error message clearly explains what went wrong
  
  - name: actionability
    weight: 0.4
    description: Error message suggests how to fix the problem
  
  - name: context
    weight: 0.2
    description: Error message includes relevant context (file, line, values)

Promptfoo Integration

Using llm-rubric Assertion

# promptfooconfig.yaml
tests:
  - vars:
      code_output: "{{generated_code}}"
    assert:
      - type: llm-rubric
        value: |
          Evaluate this code for quality:
          
          {{code_output}}
          
          Score on:
          1. Readability (0-10)
          2. Correctness (0-10)
          3. Efficiency (0-10)
          4. Maintainability (0-10)
          
          Provide overall score and specific feedback.
        threshold: 0.7

Custom Python Grader

# graders/rubric_grader.py
import json
from anthropic import Anthropic

def get_assert(output: str, context: dict) -> dict:
    """Grade output using LLM rubric."""
    rubric = context.get('config', {}).get('rubric', 'code-quality')
    rubric_def = load_rubric(rubric)
    
    client = Anthropic()
    
    prompt = f"""
You are evaluating code quality against this rubric:

{json.dumps(rubric_def, indent=2)}

Code to evaluate:

{output}


For each criterion, provide:
1. Score (0-1)
2. Brief feedback

Return JSON:
{{
  "scores": {{"criterion_name": {{"score": 0.8, "feedback": "..."}}}},
  "overall": 0.75,
  "summary": "..."
}}
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    result = json.loads(response.content[0].text)
    
    return {
        "pass": result["overall"] >= rubric_def["passingThreshold"],
        "score": result["overall"],
        "reason": result["summary"],
        "namedScores": {k: v["score"] for k, v in result["scores"].items()},
    }

Calibration

LLM graders need calibration to ensure consistency:

interface CalibrationSet {
  rubric: string;
  examples: CalibrationExample[];
}

interface CalibrationExample {
  input: string;
  expectedScore: number;
  expectedFeedback: string[];
}

async function calibrate(rubric: Rubric, examples: CalibrationExample[]): Promise<CalibrationResult> {
  const results = await Promise.all(
    examples.map(ex => gradeWithRubric(ex.input, rubric))
  );
  
  const agreement = results.filter((r, i) => 
    Math.abs(r.score - examples[i].expectedScore) < 0.1
  ).length / results.length;
  
  return {
    agreement,
    drift: results.map((r, i) => r.score - examples[i].expectedScore),
    needsAdjustment: agreement < 0.8,
  };
}

File Structure

src/graders/
├── llm/
│   ├── index.ts          # Main entry
│   ├── provider.ts       # Promptfoo custom provider
│   ├── rubric-loader.ts  # Load YAML rubrics
│   └── grader.ts         # Core grading logic
└── calibration/
    ├── calibrator.ts     # Calibration runner
    └── examples/         # Calibration datasets

rubrics/
├── code-quality.yaml
├── error-messages.yaml
├── documentation.yaml
└── api-design.yaml

graders/
└── rubric_grader.py      # Python grader for Promptfoo

When to Use LLM vs Deterministic

Use LLM Graders Use Deterministic
Subjective quality Pass/fail assertions
Style/readability Type checking
Helpfulness Value equality
Consistency Error presence
User experience Performance thresholds

Dependencies

{
  "js-yaml": "^4.1.0"
}

Success Criteria

  • Rubrics load from YAML files
  • LLM grader produces consistent scores
  • Calibration detects drift
  • Integrates with Promptfoo llm-rubric
  • Custom Python grader works
  • >80% agreement with human judgment