co-mono/packages/ai/src/utils/sanitize-unicode.ts
Mario Zechner 4e7a340460 Add Unicode surrogate sanitization for all providers
Fixes issue where unpaired Unicode surrogates in tool results cause JSON serialization errors in API providers, particularly Anthropic.

- Add sanitizeSurrogates() utility function to remove unpaired surrogates
- Apply sanitization in all provider convertMessages() functions:
  - User message text content (string and text blocks)
  - Assistant message text and thinking blocks
  - Tool result output
  - System prompts
- Valid emoji (properly paired surrogates) are preserved
- Add comprehensive test suite covering all 8 providers

Previously only Google and Groq handled unpaired surrogates correctly.
Now all providers (Anthropic, OpenAI Completions/Responses, Google, xAI, Groq, Cerebras, zAI) sanitize text before API submission.
2025-10-13 14:26:54 +02:00

25 lines
1.1 KiB
TypeScript

/**
* Removes unpaired Unicode surrogate characters from a string.
*
* Unpaired surrogates (high surrogates 0xD800-0xDBFF without matching low surrogates 0xDC00-0xDFFF,
* or vice versa) cause JSON serialization errors in many API providers.
*
* Valid emoji and other characters outside the Basic Multilingual Plane use properly paired
* surrogates and will NOT be affected by this function.
*
* @param text - The text to sanitize
* @returns The sanitized text with unpaired surrogates removed
*
* @example
* // Valid emoji (properly paired surrogates) are preserved
* sanitizeSurrogates("Hello 🙈 World") // => "Hello 🙈 World"
*
* // Unpaired high surrogate is removed
* const unpaired = String.fromCharCode(0xD83D); // high surrogate without low
* sanitizeSurrogates(`Text ${unpaired} here`) // => "Text here"
*/
export function sanitizeSurrogates(text: string): string {
// Replace unpaired high surrogates (0xD800-0xDBFF not followed by low surrogate)
// Replace unpaired low surrogates (0xDC00-0xDFFF not preceded by high surrogate)
return text.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g, "");
}