# 5. LLM Rubric Graders - System Design > **Priority**: 🟒 MEDIUM β€” Subjective quality layer > **Complexity**: Medium > **Effort Estimate**: 4-6 hours --- ## Overview LLM Rubric Graders use Claude to evaluate **subjective quality** that deterministic tests can't measure: - Code readability - Error message helpfulness - Documentation quality - API design consistency These complement functional tests with human-like judgment. --- ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LLM Grading Pipeline β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Output │───▢│ Rubric │───▢│ Grading β”‚ β”‚ β”‚ β”‚ (code/ β”‚ β”‚ + Claude β”‚ β”‚ Result β”‚ β”‚ β”‚ β”‚ text) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ Uses Promptfoo β”‚ β”‚ llm-rubric assertion β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## Core Types ```typescript interface Rubric { name: string; description: string; criteria: RubricCriterion[]; passingThreshold: number; // 0-1 } interface RubricCriterion { name: string; description: string; weight: number; // Relative weight examples?: { good: string; bad: string; }; } interface RubricGradingResult { pass: boolean; score: number; // 0-1 reason: string; criterionScores: { name: string; score: number; feedback: string; }[]; } ``` --- ## Rubric Examples ### Code Quality Rubric (`rubrics/code-quality.yaml`) ```yaml name: code-quality description: Evaluates generated code for quality and maintainability passingThreshold: 0.7 criteria: - name: readability weight: 0.3 description: Code is easy to read and understand examples: good: "Clear variable names, logical flow, proper indentation" bad: "Single-letter variables, deeply nested logic, inconsistent style" - name: correctness weight: 0.4 description: Code correctly implements the intended behavior examples: good: "Handles edge cases, correct algorithm, proper error handling" bad: "Missing edge cases, off-by-one errors, swallowed exceptions" - name: efficiency weight: 0.2 description: Code uses appropriate data structures and algorithms examples: good: "O(n) where O(n) is optimal, avoids unnecessary allocations" bad: "O(nΒ²) when O(n) is possible, creates objects in tight loops" - name: maintainability weight: 0.1 description: Code is easy to modify and extend examples: good: "Single responsibility, low coupling, clear interfaces" bad: "God functions, tight coupling, magic numbers" ``` ### Error Messages Rubric (`rubrics/error-messages.yaml`) ```yaml name: error-messages description: Evaluates quality of error messages passingThreshold: 0.6 criteria: - name: clarity weight: 0.4 description: Error message clearly explains what went wrong - name: actionability weight: 0.4 description: Error message suggests how to fix the problem - name: context weight: 0.2 description: Error message includes relevant context (file, line, values) ``` --- ## Promptfoo Integration ### Using `llm-rubric` Assertion ```yaml # promptfooconfig.yaml tests: - vars: code_output: "{{generated_code}}" assert: - type: llm-rubric value: | Evaluate this code for quality: {{code_output}} Score on: 1. Readability (0-10) 2. Correctness (0-10) 3. Efficiency (0-10) 4. Maintainability (0-10) Provide overall score and specific feedback. threshold: 0.7 ``` ### Custom Python Grader ```python # graders/rubric_grader.py import json from anthropic import Anthropic def get_assert(output: str, context: dict) -> dict: """Grade output using LLM rubric.""" rubric = context.get('config', {}).get('rubric', 'code-quality') rubric_def = load_rubric(rubric) client = Anthropic() prompt = f""" You are evaluating code quality against this rubric: {json.dumps(rubric_def, indent=2)} Code to evaluate: ``` {output} ``` For each criterion, provide: 1. Score (0-1) 2. Brief feedback Return JSON: {{ "scores": {{"criterion_name": {{"score": 0.8, "feedback": "..."}}}}, "overall": 0.75, "summary": "..." }} """ response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) result = json.loads(response.content[0].text) return { "pass": result["overall"] >= rubric_def["passingThreshold"], "score": result["overall"], "reason": result["summary"], "namedScores": {k: v["score"] for k, v in result["scores"].items()}, } ``` --- ## Calibration LLM graders need calibration to ensure consistency: ```typescript interface CalibrationSet { rubric: string; examples: CalibrationExample[]; } interface CalibrationExample { input: string; expectedScore: number; expectedFeedback: string[]; } async function calibrate(rubric: Rubric, examples: CalibrationExample[]): Promise { const results = await Promise.all( examples.map(ex => gradeWithRubric(ex.input, rubric)) ); const agreement = results.filter((r, i) => Math.abs(r.score - examples[i].expectedScore) < 0.1 ).length / results.length; return { agreement, drift: results.map((r, i) => r.score - examples[i].expectedScore), needsAdjustment: agreement < 0.8, }; } ``` --- ## File Structure ``` src/graders/ β”œβ”€β”€ llm/ β”‚ β”œβ”€β”€ index.ts # Main entry β”‚ β”œβ”€β”€ provider.ts # Promptfoo custom provider β”‚ β”œβ”€β”€ rubric-loader.ts # Load YAML rubrics β”‚ └── grader.ts # Core grading logic └── calibration/ β”œβ”€β”€ calibrator.ts # Calibration runner └── examples/ # Calibration datasets rubrics/ β”œβ”€β”€ code-quality.yaml β”œβ”€β”€ error-messages.yaml β”œβ”€β”€ documentation.yaml └── api-design.yaml graders/ └── rubric_grader.py # Python grader for Promptfoo ``` --- ## When to Use LLM vs Deterministic | Use LLM Graders | Use Deterministic | |-----------------|-------------------| | Subjective quality | Pass/fail assertions | | Style/readability | Type checking | | Helpfulness | Value equality | | Consistency | Error presence | | User experience | Performance thresholds | --- ## Dependencies ```json { "js-yaml": "^4.1.0" } ``` --- ## Success Criteria - [ ] Rubrics load from YAML files - [ ] LLM grader produces consistent scores - [ ] Calibration detects drift - [ ] Integrates with Promptfoo `llm-rubric` - [ ] Custom Python grader works - [ ] >80% agreement with human judgment