evaluclaude-harness/docs/01-codebase-analyzer-prompt.md
2026-01-11 16:58:40 -05:00

5 KiB

1. Codebase Analyzer Prompt - System Design

Priority: 🟡 HIGH — Core LLM logic
Complexity: High (prompt engineering)
Effort Estimate: 8-12 hours (iterative refinement)


Overview

The Codebase Analyzer takes structured RepoSummary from the introspector and generates EvalSpec JSON defining what tests to create. Key insight: Claude generates specs, not code. Test code is deterministically rendered from specs.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Codebase Analyzer Agent                      │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │ RepoSummary  │───▶│ Claude Agent │───▶│   EvalSpec   │      │
│  │    JSON      │    │    SDK       │    │    JSON      │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│                            │                                    │
│                            ▼                                    │
│                    ┌──────────────┐                             │
│                    │AskUserQuestion│                            │
│                    │   (optional)  │                            │
│                    └──────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

Core Types

interface EvalSpec {
  version: '1.0';
  repo: { name: string; languages: string[]; analyzedAt: string };
  scenarios: EvalScenario[];
  grading: {
    deterministic: DeterministicGrade[];
    rubrics: RubricGrade[];
  };
  metadata: {
    generatedBy: string;
    totalTokens: number;
    questionsAsked: number;
    confidence: 'low' | 'medium' | 'high';
  };
}

interface EvalScenario {
  id: string;                    // "auth-login-success"
  name: string;
  description: string;
  target: {
    module: string;
    function: string;
    type: 'function' | 'method' | 'class';
  };
  category: 'unit' | 'integration' | 'edge-case' | 'negative';
  priority: 'critical' | 'high' | 'medium' | 'low';
  setup?: { fixtures: string[]; mocks: MockSpec[] };
  input: { args: Record<string, any>; kwargs?: Record<string, any> };
  assertions: Assertion[];
  tags: string[];
}

Prompt Architecture (Three-Part)

1. System Prompt

  • Defines Claude's identity as codebase analyzer
  • Constraints: functional tests only, no syntax checks, ask don't assume

2. Developer Prompt

  • Contains EvalSpec JSON schema
  • Formatting rules (snake_case, kebab-case IDs)
  • Assertion type reference

3. User Prompt (Template)

  • Injects RepoSummary JSON
  • User context about what to evaluate
  • Instructions for output format

Key Implementation

async function generateEvalSpec(options: GenerateOptions): Promise<EvalSpec> {
  const agentOptions: ClaudeAgentOptions = {
    systemPrompt: await loadPrompt('analyzer-system.md'),
    permissionMode: options.interactive ? 'default' : 'dontAsk',
    canUseTool: async ({ toolName, input }) => {
      if (toolName === 'AskUserQuestion' && options.onQuestion) {
        const answer = await options.onQuestion(input);
        return { behavior: 'allow', updatedInput: { ...input, answers: { [input.question]: answer } } };
      }
      return { behavior: 'deny' };
    },
    outputFormat: { type: 'json_schema', json_schema: { name: 'EvalSpec', schema: EVAL_SPEC_SCHEMA } },
  };

  for await (const msg of query(prompt, agentOptions)) {
    if (msg.type === 'result') return msg.output as EvalSpec;
  }
}

File Structure

src/analyzer/
├── index.ts              # Main entry point
├── types.ts              # EvalSpec types
├── spec-generator.ts     # Claude Agent SDK integration
├── validator.ts          # JSON schema validation
└── prompt-builder.ts     # Builds prompts from templates

prompts/
├── analyzer-system.md
├── analyzer-developer.md
└── analyzer-user.md

Success Criteria

  • Generates valid EvalSpec JSON for Python repos
  • Generates valid EvalSpec JSON for TypeScript repos
  • Asks 2-3 clarifying questions on complex repos
  • <10k tokens per analysis
  • 100% assertion coverage (every scenario has assertions)