mirror of
https://github.com/getcompanion-ai/co-mono.git
synced 2026-04-20 14:05:08 +00:00
docs(ai): Add comprehensive image input documentation for all LLM providers
- Document image support for Anthropic, Google GenAI, OpenAI APIs - Include format requirements, size limits, and API examples - Propose unified abstraction layer for cross-provider image handling - Add implementation examples for format conversion and validation
This commit is contained in:
parent
545d04fc5c
commit
0b50c3f36d
1 changed files with 384 additions and 0 deletions
384
packages/ai/docs/images.md
Normal file
384
packages/ai/docs/images.md
Normal file
|
|
@ -0,0 +1,384 @@
|
|||
# Image Input Support for LLM Providers
|
||||
|
||||
This document describes how to submit images to different LLM provider APIs and proposes an abstraction layer for unified image handling.
|
||||
|
||||
## Provider-Specific Image Support
|
||||
|
||||
### 1. Anthropic (Claude)
|
||||
|
||||
**Supported Models**: Claude 3 and Claude 4 families (Sonnet, Haiku, Opus)
|
||||
|
||||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||||
|
||||
**Methods**:
|
||||
1. **Base64 Encoding**:
|
||||
```json
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"source": {
|
||||
"type": "base64",
|
||||
"media_type": "image/jpeg",
|
||||
"data": "<base64_encoded_image_data>"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this image?"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. **URL Support**:
|
||||
```json
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"source": {
|
||||
"type": "url",
|
||||
"url": "https://example.com/image.jpg"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Limitations**:
|
||||
- Maximum 20 images per request
|
||||
- Each image max 3.75 MB
|
||||
- Maximum dimensions: 8,000px × 8,000px
|
||||
- Images are ephemeral (not stored beyond request duration)
|
||||
|
||||
### 2. Google GenAI (Gemini)
|
||||
|
||||
**Supported Models**: Gemini Pro Vision, Gemini 1.5, Gemini 2.0
|
||||
|
||||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||||
|
||||
**Methods**:
|
||||
1. **Inline Base64 Data** (for files < 20MB):
|
||||
```json
|
||||
{
|
||||
"contents": [{
|
||||
"parts": [
|
||||
{
|
||||
"inline_data": {
|
||||
"mime_type": "image/jpeg",
|
||||
"data": "BASE64_ENCODED_IMAGE_DATA"
|
||||
}
|
||||
},
|
||||
{
|
||||
"text": "Describe this image"
|
||||
}
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
2. **File API** (for larger files or reuse):
|
||||
- Upload file first using File API
|
||||
- Reference by file URI in subsequent requests
|
||||
|
||||
**Limitations**:
|
||||
- Inline data: Total request size (text + images) < 20MB
|
||||
- Base64 encoding increases size in transit
|
||||
- Returns HTTP 413 if request too large
|
||||
|
||||
### 3. OpenAI Chat Completions (GPT-4o, GPT-4o-mini)
|
||||
|
||||
**Supported Models**: GPT-4o, GPT-4o-mini, GPT-4-turbo with vision
|
||||
|
||||
**Image Formats**: JPEG, PNG, GIF, WebP
|
||||
|
||||
**Methods**:
|
||||
1. **URL Reference**:
|
||||
```json
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What's in this image?"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://example.com/image.jpg"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. **Base64 Data URL**:
|
||||
```json
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "data:image/jpeg;base64,<base64_encoded_image>"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: Despite the field name `image_url`, base64 data URLs are supported.
|
||||
|
||||
### 4. OpenAI Responses API (o1, o3, o4-mini)
|
||||
|
||||
**Vision Support by Model**:
|
||||
- ✅ **o1**: Full vision support
|
||||
- ✅ **o3**: Vision support + image generation
|
||||
- ✅ **o4-mini**: Vision support + image generation
|
||||
- ❌ **o3-mini**: No vision capabilities
|
||||
- ✅ **o3-pro**: Vision analysis (no generation)
|
||||
|
||||
**Methods**: Same as Chat Completions API
|
||||
- URL references
|
||||
- Base64 data URLs
|
||||
|
||||
**Note**: Vision capabilities integrated into reasoning chain-of-thought for more contextually rich responses.
|
||||
|
||||
## Proposed Unified Abstraction
|
||||
|
||||
### Image Content Type
|
||||
|
||||
```typescript
|
||||
interface ImageContent {
|
||||
type: "image";
|
||||
source: ImageSource;
|
||||
alt?: string; // Optional alt text for accessibility
|
||||
}
|
||||
|
||||
type ImageSource =
|
||||
| { type: "base64"; data: string; mimeType: string }
|
||||
| { type: "url"; url: string }
|
||||
| { type: "file"; path: string }; // Local file path
|
||||
```
|
||||
|
||||
### Unified Message Structure
|
||||
|
||||
```typescript
|
||||
interface UserMessage {
|
||||
role: "user";
|
||||
content: string | (TextContent | ImageContent)[];
|
||||
}
|
||||
|
||||
interface TextContent {
|
||||
type: "text";
|
||||
text: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Provider Adapter Implementation
|
||||
|
||||
Each provider adapter would:
|
||||
|
||||
1. **Check Model Capabilities**:
|
||||
```typescript
|
||||
if (model.input.includes("image")) {
|
||||
// Process image content
|
||||
} else {
|
||||
// Throw error or ignore images
|
||||
}
|
||||
```
|
||||
|
||||
2. **Convert to Provider Format**:
|
||||
|
||||
```typescript
|
||||
// Anthropic converter
|
||||
function toAnthropicContent(content: (TextContent | ImageContent)[]) {
|
||||
return content.map(item => {
|
||||
if (item.type === "image") {
|
||||
if (item.source.type === "base64") {
|
||||
return {
|
||||
type: "image",
|
||||
source: {
|
||||
type: "base64",
|
||||
media_type: item.source.mimeType,
|
||||
data: item.source.data
|
||||
}
|
||||
};
|
||||
} else if (item.source.type === "url") {
|
||||
return {
|
||||
type: "image",
|
||||
source: {
|
||||
type: "url",
|
||||
url: item.source.url
|
||||
}
|
||||
};
|
||||
} else if (item.source.type === "file") {
|
||||
// Read file and convert to base64
|
||||
const data = fs.readFileSync(item.source.path).toString('base64');
|
||||
const mimeType = getMimeType(item.source.path);
|
||||
return {
|
||||
type: "image",
|
||||
source: {
|
||||
type: "base64",
|
||||
media_type: mimeType,
|
||||
data
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
return { type: "text", text: item.text };
|
||||
});
|
||||
}
|
||||
|
||||
// OpenAI converter
|
||||
function toOpenAIContent(content: (TextContent | ImageContent)[]) {
|
||||
return content.map(item => {
|
||||
if (item.type === "image") {
|
||||
if (item.source.type === "base64") {
|
||||
return {
|
||||
type: "image_url",
|
||||
image_url: {
|
||||
url: `data:${item.source.mimeType};base64,${item.source.data}`
|
||||
}
|
||||
};
|
||||
} else if (item.source.type === "url") {
|
||||
return {
|
||||
type: "image_url",
|
||||
image_url: { url: item.source.url }
|
||||
};
|
||||
} else if (item.source.type === "file") {
|
||||
// Read and convert to data URL
|
||||
const data = fs.readFileSync(item.source.path).toString('base64');
|
||||
const mimeType = getMimeType(item.source.path);
|
||||
return {
|
||||
type: "image_url",
|
||||
image_url: {
|
||||
url: `data:${mimeType};base64,${data}`
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
return { type: "text", text: item.text };
|
||||
});
|
||||
}
|
||||
|
||||
// Google converter
|
||||
function toGoogleContent(content: (TextContent | ImageContent)[]) {
|
||||
return content.map(item => {
|
||||
if (item.type === "image") {
|
||||
if (item.source.type === "base64") {
|
||||
return {
|
||||
inline_data: {
|
||||
mime_type: item.source.mimeType,
|
||||
data: item.source.data
|
||||
}
|
||||
};
|
||||
} else if (item.source.type === "url") {
|
||||
// Google doesn't support external URLs directly
|
||||
// Would need to fetch and convert to base64
|
||||
throw new Error("Google GenAI requires base64 or File API for images");
|
||||
} else if (item.source.type === "file") {
|
||||
const data = fs.readFileSync(item.source.path).toString('base64');
|
||||
const mimeType = getMimeType(item.source.path);
|
||||
return {
|
||||
inline_data: {
|
||||
mime_type: mimeType,
|
||||
data
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
return { text: item.text };
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### Size and Format Validation
|
||||
|
||||
```typescript
|
||||
interface ImageConstraints {
|
||||
maxSizeMB: number;
|
||||
maxWidth: number;
|
||||
maxHeight: number;
|
||||
maxCount: number;
|
||||
supportedFormats: string[];
|
||||
}
|
||||
|
||||
const PROVIDER_CONSTRAINTS: Record<string, ImageConstraints> = {
|
||||
anthropic: {
|
||||
maxSizeMB: 3.75,
|
||||
maxWidth: 8000,
|
||||
maxHeight: 8000,
|
||||
maxCount: 20,
|
||||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||||
},
|
||||
google: {
|
||||
maxSizeMB: 20, // for inline data
|
||||
maxWidth: Infinity,
|
||||
maxHeight: Infinity,
|
||||
maxCount: Infinity,
|
||||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||||
},
|
||||
openai: {
|
||||
maxSizeMB: 20,
|
||||
maxWidth: Infinity,
|
||||
maxHeight: Infinity,
|
||||
maxCount: Infinity,
|
||||
supportedFormats: ["image/jpeg", "image/png", "image/gif", "image/webp"]
|
||||
}
|
||||
};
|
||||
|
||||
async function validateImage(
|
||||
source: ImageSource,
|
||||
provider: string
|
||||
): Promise<void> {
|
||||
const constraints = PROVIDER_CONSTRAINTS[provider];
|
||||
|
||||
// Get image data
|
||||
let imageBuffer: Buffer;
|
||||
if (source.type === "file") {
|
||||
imageBuffer = await fs.readFile(source.path);
|
||||
} else if (source.type === "base64") {
|
||||
imageBuffer = Buffer.from(source.data, 'base64');
|
||||
} else {
|
||||
// For URLs, might need to fetch and validate
|
||||
return;
|
||||
}
|
||||
|
||||
// Check size
|
||||
const sizeMB = imageBuffer.length / (1024 * 1024);
|
||||
if (sizeMB > constraints.maxSizeMB) {
|
||||
throw new Error(`Image exceeds ${constraints.maxSizeMB}MB limit`);
|
||||
}
|
||||
|
||||
// Could add dimension checks using image processing library
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Considerations
|
||||
|
||||
1. **Automatic Format Conversion**:
|
||||
- Convert URLs to base64 for providers that don't support URLs
|
||||
- Handle file paths by reading and encoding files
|
||||
- Optimize image size/quality when needed
|
||||
|
||||
2. **Error Handling**:
|
||||
- Validate image formats before sending
|
||||
- Check model capabilities
|
||||
- Provide clear error messages for unsupported features
|
||||
|
||||
3. **Performance**:
|
||||
- Cache base64 encodings for reused images
|
||||
- Stream large images when possible
|
||||
- Consider using provider-specific file upload APIs for large images
|
||||
|
||||
4. **Token Counting**:
|
||||
- Images consume tokens (varies by provider)
|
||||
- Include image token estimates in usage calculations
|
||||
|
||||
5. **Fallback Strategies**:
|
||||
- If model doesn't support images, extract text description
|
||||
- Offer image-to-text preprocessing options
|
||||
Loading…
Add table
Add a link
Reference in a new issue