co-mono/packages/ai/docs/images.md
Mario Zechner 4ac0c6ea28 docs(ai): Simplify ImageContent interface to base64-only approach
- Change ImageContent to simple { type, data, mimeType } structure
- Remove URL and file path support from core interface
- Simplify provider converters to work with base64 data only
- Update validation and implementation considerations
- Clarify that preprocessing is user's responsibility
2025-08-30 18:12:36 +02:00

7.2 KiB
Raw Blame History

Image Input Support for LLM Providers

This document describes how to submit images to different LLM provider APIs and proposes an abstraction layer for unified image handling.

Provider-Specific Image Support

1. Anthropic (Claude)

Supported Models: Claude 3 and Claude 4 families (Sonnet, Haiku, Opus)

Image Formats: JPEG, PNG, GIF, WebP

Methods:

  1. Base64 Encoding:
{
  "role": "user",
  "content": [
    {
      "type": "image",
      "source": {
        "type": "base64",
        "media_type": "image/jpeg",
        "data": "<base64_encoded_image_data>"
      }
    },
    {
      "type": "text",
      "text": "What's in this image?"
    }
  ]
}
  1. URL Support:
{
  "role": "user",
  "content": [
    {
      "type": "image",
      "source": {
        "type": "url",
        "url": "https://example.com/image.jpg"
      }
    }
  ]
}

Limitations:

  • Maximum 20 images per request
  • Each image max 3.75 MB
  • Maximum dimensions: 8,000px × 8,000px
  • Images are ephemeral (not stored beyond request duration)

2. Google GenAI (Gemini)

Supported Models: Gemini Pro Vision, Gemini 1.5, Gemini 2.0

Image Formats: JPEG, PNG, GIF, WebP

Methods:

  1. Inline Base64 Data (for files < 20MB):
{
  "contents": [{
    "parts": [
      {
        "inline_data": {
          "mime_type": "image/jpeg",
          "data": "BASE64_ENCODED_IMAGE_DATA"
        }
      },
      {
        "text": "Describe this image"
      }
    ]
  }]
}
  1. File API (for larger files or reuse):
  • Upload file first using File API
  • Reference by file URI in subsequent requests

Limitations:

  • Inline data: Total request size (text + images) < 20MB
  • Base64 encoding increases size in transit
  • Returns HTTP 413 if request too large

3. OpenAI Chat Completions (GPT-4o, GPT-4o-mini)

Supported Models: GPT-4o, GPT-4o-mini, GPT-4-turbo with vision

Image Formats: JPEG, PNG, GIF, WebP

Methods:

  1. URL Reference:
{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.jpg"
      }
    }
  ]
}
  1. Base64 Data URL:
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,<base64_encoded_image>"
      }
    }
  ]
}

Note: Despite the field name image_url, base64 data URLs are supported.

4. OpenAI Responses API (o1, o3, o4-mini)

Vision Support by Model:

  • o1: Full vision support
  • o3: Vision support + image generation
  • o4-mini: Vision support + image generation
  • o3-mini: No vision capabilities
  • o3-pro: Vision analysis (no generation)

Methods: Same as Chat Completions API

  • URL references
  • Base64 data URLs

Note: Vision capabilities integrated into reasoning chain-of-thought for more contextually rich responses.

Proposed Unified Abstraction

Image Content Type

interface ImageContent {
  type: "image";
  data: string; // base64 encoded image data
  mimeType: string; // e.g., "image/jpeg", "image/png"
}

Unified Message Structure

interface UserMessage {
  role: "user";
  content: string | (TextContent | ImageContent)[];
}

interface TextContent {
  type: "text";
  text: string;
}

Provider Adapter Implementation

Each provider adapter would:

  1. Check Model Capabilities:
if (model.input.includes("image")) {
  // Process image content
} else {
  // Throw error or ignore images
}
  1. Convert to Provider Format:
// Anthropic converter
function toAnthropicContent(content: (TextContent | ImageContent)[]) {
  return content.map(item => {
    if (item.type === "image") {
      return {
        type: "image",
        source: {
          type: "base64",
          media_type: item.mimeType,
          data: item.data
        }
      };
    }
    return { type: "text", text: item.text };
  });
}

// OpenAI converter
function toOpenAIContent(content: (TextContent | ImageContent)[]) {
  return content.map(item => {
    if (item.type === "image") {
      return {
        type: "image_url",
        image_url: {
          url: `data:${item.mimeType};base64,${item.data}`
        }
      };
    }
    return { type: "text", text: item.text };
  });
}

// Google converter
function toGoogleContent(content: (TextContent | ImageContent)[]) {
  return content.map(item => {
    if (item.type === "image") {
      return {
        inline_data: {
          mime_type: item.mimeType,
          data: item.data
        }
      };
    }
    return { text: item.text };
  });
}

Size and Format Validation

interface ImageConstraints {
  maxSizeMB: number;
  maxWidth: number;
  maxHeight: number;
  maxCount: number;
  supportedFormats: string[];
}

const PROVIDER_CONSTRAINTS: Record<string, ImageConstraints> = {
  anthropic: {
    maxSizeMB: 3.75,
    maxWidth: 8000,
    maxHeight: 8000,
    maxCount: 20,
    supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
  },
  google: {
    maxSizeMB: 20, // for inline data
    maxWidth: Infinity,
    maxHeight: Infinity,
    maxCount: Infinity,
    supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
  },
  openai: {
    maxSizeMB: 20,
    maxWidth: Infinity,
    maxHeight: Infinity,
    maxCount: Infinity,
    supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
  }
};

async function validateImage(
  image: ImageContent, 
  provider: string
): Promise<void> {
  const constraints = PROVIDER_CONSTRAINTS[provider];
  
  // Check MIME type
  if (!constraints.supportedFormats.includes(image.mimeType)) {
    throw new Error(`Unsupported image format: ${image.mimeType}`);
  }
  
  // Check size
  const imageBuffer = Buffer.from(image.data, 'base64');
  const sizeMB = imageBuffer.length / (1024 * 1024);
  if (sizeMB > constraints.maxSizeMB) {
    throw new Error(`Image exceeds ${constraints.maxSizeMB}MB limit`);
  }
  
  // Could add dimension checks using image processing library
}

Implementation Considerations

  1. Preprocessing:

    • User is responsible for converting images to base64 before passing to API
    • Utility functions could be provided for common conversions (file to base64, URL to base64)
    • Image optimization (resize/compress) should happen before encoding
  2. Error Handling:

    • Validate MIME types and sizes before sending
    • Check model capabilities (via model.input.includes("image"))
    • Provide clear error messages for unsupported features
  3. Performance:

    • Base64 encoding increases payload size by ~33%
    • Consider image compression before encoding
    • For Google GenAI, be aware of 20MB total request limit
  4. Token Counting:

    • Images consume tokens (varies by provider and image size)
    • Include image token estimates in usage calculations
    • Anthropic: ~1 token per ~3-4 bytes of base64 data
    • OpenAI: Detailed images consume more tokens than low-detail
  5. Fallback Strategies:

    • If model doesn't support images, throw error or ignore images
    • Consider offering text-only fallback for non-vision models