mirror of
https://github.com/getcompanion-ai/co-mono.git
synced 2026-04-15 19:05:11 +00:00
- Change ImageContent to simple { type, data, mimeType } structure
- Remove URL and file path support from core interface
- Simplify provider converters to work with base64 data only
- Update validation and implementation considerations
- Clarify that preprocessing is user's responsibility
322 lines
No EOL
7.2 KiB
Markdown
322 lines
No EOL
7.2 KiB
Markdown
# Image Input Support for LLM Providers
|
||
|
||
This document describes how to submit images to different LLM provider APIs and proposes an abstraction layer for unified image handling.
|
||
|
||
## Provider-Specific Image Support
|
||
|
||
### 1. Anthropic (Claude)
|
||
|
||
**Supported Models**: Claude 3 and Claude 4 families (Sonnet, Haiku, Opus)
|
||
|
||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||
|
||
**Methods**:
|
||
1. **Base64 Encoding**:
|
||
```json
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"source": {
|
||
"type": "base64",
|
||
"media_type": "image/jpeg",
|
||
"data": "<base64_encoded_image_data>"
|
||
}
|
||
},
|
||
{
|
||
"type": "text",
|
||
"text": "What's in this image?"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
2. **URL Support**:
|
||
```json
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image",
|
||
"source": {
|
||
"type": "url",
|
||
"url": "https://example.com/image.jpg"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Limitations**:
|
||
- Maximum 20 images per request
|
||
- Each image max 3.75 MB
|
||
- Maximum dimensions: 8,000px × 8,000px
|
||
- Images are ephemeral (not stored beyond request duration)
|
||
|
||
### 2. Google GenAI (Gemini)
|
||
|
||
**Supported Models**: Gemini Pro Vision, Gemini 1.5, Gemini 2.0
|
||
|
||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||
|
||
**Methods**:
|
||
1. **Inline Base64 Data** (for files < 20MB):
|
||
```json
|
||
{
|
||
"contents": [{
|
||
"parts": [
|
||
{
|
||
"inline_data": {
|
||
"mime_type": "image/jpeg",
|
||
"data": "BASE64_ENCODED_IMAGE_DATA"
|
||
}
|
||
},
|
||
{
|
||
"text": "Describe this image"
|
||
}
|
||
]
|
||
}]
|
||
}
|
||
```
|
||
|
||
2. **File API** (for larger files or reuse):
|
||
- Upload file first using File API
|
||
- Reference by file URI in subsequent requests
|
||
|
||
**Limitations**:
|
||
- Inline data: Total request size (text + images) < 20MB
|
||
- Base64 encoding increases size in transit
|
||
- Returns HTTP 413 if request too large
|
||
|
||
### 3. OpenAI Chat Completions (GPT-4o, GPT-4o-mini)
|
||
|
||
**Supported Models**: GPT-4o, GPT-4o-mini, GPT-4-turbo with vision
|
||
|
||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||
|
||
**Methods**:
|
||
1. **URL Reference**:
|
||
```json
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "text",
|
||
"text": "What's in this image?"
|
||
},
|
||
{
|
||
"type": "image_url",
|
||
"image_url": {
|
||
"url": "https://example.com/image.jpg"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
2. **Base64 Data URL**:
|
||
```json
|
||
{
|
||
"role": "user",
|
||
"content": [
|
||
{
|
||
"type": "image_url",
|
||
"image_url": {
|
||
"url": "data:image/jpeg;base64,<base64_encoded_image>"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Note**: Despite the field name `image_url`, base64 data URLs are supported.
|
||
|
||
### 4. OpenAI Responses API (o1, o3, o4-mini)
|
||
|
||
**Vision Support by Model**:
|
||
- ✅ **o1**: Full vision support
|
||
- ✅ **o3**: Vision support + image generation
|
||
- ✅ **o4-mini**: Vision support + image generation
|
||
- ❌ **o3-mini**: No vision capabilities
|
||
- ✅ **o3-pro**: Vision analysis (no generation)
|
||
|
||
**Methods**: Same as Chat Completions API
|
||
- URL references
|
||
- Base64 data URLs
|
||
|
||
**Note**: Vision capabilities integrated into reasoning chain-of-thought for more contextually rich responses.
|
||
|
||
## Proposed Unified Abstraction
|
||
|
||
### Image Content Type
|
||
|
||
```typescript
|
||
interface ImageContent {
|
||
type: "image";
|
||
data: string; // base64 encoded image data
|
||
mimeType: string; // e.g., "image/jpeg", "image/png"
|
||
}
|
||
```
|
||
|
||
### Unified Message Structure
|
||
|
||
```typescript
|
||
interface UserMessage {
|
||
role: "user";
|
||
content: string | (TextContent | ImageContent)[];
|
||
}
|
||
|
||
interface TextContent {
|
||
type: "text";
|
||
text: string;
|
||
}
|
||
```
|
||
|
||
### Provider Adapter Implementation
|
||
|
||
Each provider adapter would:
|
||
|
||
1. **Check Model Capabilities**:
|
||
```typescript
|
||
if (model.input.includes("image")) {
|
||
// Process image content
|
||
} else {
|
||
// Throw error or ignore images
|
||
}
|
||
```
|
||
|
||
2. **Convert to Provider Format**:
|
||
|
||
```typescript
|
||
// Anthropic converter
|
||
function toAnthropicContent(content: (TextContent | ImageContent)[]) {
|
||
return content.map(item => {
|
||
if (item.type === "image") {
|
||
return {
|
||
type: "image",
|
||
source: {
|
||
type: "base64",
|
||
media_type: item.mimeType,
|
||
data: item.data
|
||
}
|
||
};
|
||
}
|
||
return { type: "text", text: item.text };
|
||
});
|
||
}
|
||
|
||
// OpenAI converter
|
||
function toOpenAIContent(content: (TextContent | ImageContent)[]) {
|
||
return content.map(item => {
|
||
if (item.type === "image") {
|
||
return {
|
||
type: "image_url",
|
||
image_url: {
|
||
url: `data:${item.mimeType};base64,${item.data}`
|
||
}
|
||
};
|
||
}
|
||
return { type: "text", text: item.text };
|
||
});
|
||
}
|
||
|
||
// Google converter
|
||
function toGoogleContent(content: (TextContent | ImageContent)[]) {
|
||
return content.map(item => {
|
||
if (item.type === "image") {
|
||
return {
|
||
inline_data: {
|
||
mime_type: item.mimeType,
|
||
data: item.data
|
||
}
|
||
};
|
||
}
|
||
return { text: item.text };
|
||
});
|
||
}
|
||
```
|
||
|
||
### Size and Format Validation
|
||
|
||
```typescript
|
||
interface ImageConstraints {
|
||
maxSizeMB: number;
|
||
maxWidth: number;
|
||
maxHeight: number;
|
||
maxCount: number;
|
||
supportedFormats: string[];
|
||
}
|
||
|
||
const PROVIDER_CONSTRAINTS: Record<string, ImageConstraints> = {
|
||
anthropic: {
|
||
maxSizeMB: 3.75,
|
||
maxWidth: 8000,
|
||
maxHeight: 8000,
|
||
maxCount: 20,
|
||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||
},
|
||
google: {
|
||
maxSizeMB: 20, // for inline data
|
||
maxWidth: Infinity,
|
||
maxHeight: Infinity,
|
||
maxCount: Infinity,
|
||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||
},
|
||
openai: {
|
||
maxSizeMB: 20,
|
||
maxWidth: Infinity,
|
||
maxHeight: Infinity,
|
||
maxCount: Infinity,
|
||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||
}
|
||
};
|
||
|
||
async function validateImage(
|
||
image: ImageContent,
|
||
provider: string
|
||
): Promise<void> {
|
||
const constraints = PROVIDER_CONSTRAINTS[provider];
|
||
|
||
// Check MIME type
|
||
if (!constraints.supportedFormats.includes(image.mimeType)) {
|
||
throw new Error(`Unsupported image format: ${image.mimeType}`);
|
||
}
|
||
|
||
// Check size
|
||
const imageBuffer = Buffer.from(image.data, 'base64');
|
||
const sizeMB = imageBuffer.length / (1024 * 1024);
|
||
if (sizeMB > constraints.maxSizeMB) {
|
||
throw new Error(`Image exceeds ${constraints.maxSizeMB}MB limit`);
|
||
}
|
||
|
||
// Could add dimension checks using image processing library
|
||
}
|
||
```
|
||
|
||
## Implementation Considerations
|
||
|
||
1. **Preprocessing**:
|
||
- User is responsible for converting images to base64 before passing to API
|
||
- Utility functions could be provided for common conversions (file to base64, URL to base64)
|
||
- Image optimization (resize/compress) should happen before encoding
|
||
|
||
2. **Error Handling**:
|
||
- Validate MIME types and sizes before sending
|
||
- Check model capabilities (via `model.input.includes("image")`)
|
||
- Provide clear error messages for unsupported features
|
||
|
||
3. **Performance**:
|
||
- Base64 encoding increases payload size by ~33%
|
||
- Consider image compression before encoding
|
||
- For Google GenAI, be aware of 20MB total request limit
|
||
|
||
4. **Token Counting**:
|
||
- Images consume tokens (varies by provider and image size)
|
||
- Include image token estimates in usage calculations
|
||
- Anthropic: ~1 token per ~3-4 bytes of base64 data
|
||
- OpenAI: Detailed images consume more tokens than low-detail
|
||
|
||
5. **Fallback Strategies**:
|
||
- If model doesn't support images, throw error or ignore images
|
||
- Consider offering text-only fallback for non-vision models |