fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004
fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004weiguangli-io wants to merge 7 commits intolivekit:mainfrom
Conversation
Google Cloud TTS synthesize_speech API rejects input.text or input.ssml longer than 5000 bytes. When using non-streaming mode (use_streaming=False), the plugin sent the full text in a single request without checking byte length, causing 400 INVALID_ARGUMENT errors even for moderately sized text -- especially with multi-byte scripts like Telugu. This change adds automatic text chunking in ChunkedStream._run() that splits input at sentence, word, and character boundaries to stay within the 5000-byte limit. SSML wrapper overhead (<speak></speak>) is also accounted for when enable_ssml=True. Fixes livekit#4762
davidzhao
left a comment
There was a problem hiding this comment.
nice catch, let's update the tokenization strategy and we can merge it
| else: | ||
| max_bytes = GOOGLE_TTS_MAX_INPUT_BYTES | ||
|
|
||
| return _split_text_by_bytes(self._input_text, max_bytes) |
There was a problem hiding this comment.
a tokenizer should be used instead of hand crafting one. see: https://github.com/livekit/agents/blob/main/livekit-plugins/livekit-plugins-cartesia/livekit/plugins/cartesia/tts.py#L399 as an example
we already bundle blingfire for tokenization
…ex for text splitting Replace the regex-based sentence splitting (re.split on punctuation) with the blingfire SentenceTokenizer that is already bundled with livekit-agents. This follows the same pattern used by the Cartesia TTS plugin and provides more robust sentence boundary detection.
Structured content (SSML, markup) cannot be naively split by the sentence tokenizer as it would break XML tag structure. Return such content as-is and only apply byte-limit splitting to plain text. Remove the now-unused SSML_WRAPPER_OVERHEAD constant.
Replace the standalone _split_text_by_bytes, _split_on_words, and _split_on_chars helper functions with the bundled blingfire sentence tokenizer and basic word tokenizer for text chunking in _get_text_chunks. This aligns with the tokenization pattern used in other plugins (e.g. Cartesia). Signed-off-by: OiPunk <codingpunk@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Container-based formats (MP3, OGG_OPUS) produce independently encoded files with their own headers for each synthesize_speech call. When multiple chunks are concatenated into a single AudioEmitter, the decoder cannot handle multiple container files, resulting in corrupted or truncated audio. Fall back to PCM (raw samples) which is safe to concatenate. Signed-off-by: OiPunk <codingpunk@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enizer Move the byte-limit chunking logic from _get_text_chunks into a standalone _split_sentences_by_bytes function for better testability and separation of concerns. Promote the WordTokenizer instance to a module-level constant to avoid re-creating it on every call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| default) for boundary detection, with a word tokenizer fallback for | ||
| sentences that individually exceed the byte limit. Structured content |
There was a problem hiding this comment.
I don't think we need a word tokenizer fallback here. do you see a realistic use case where a sentence is more than 5000 chars?
synthesizing part of a sentence will generally not produce the right intonations or prosody
There was a problem hiding this comment.
Good point — a single sentence realistically won't exceed 5000 bytes. I'll remove the word tokenizer fallback and keep it simple with just sentence-level splitting. Will push the update shortly.
Per maintainer feedback: splitting mid-sentence produces incorrect intonation. Instead of falling back to word-level tokenization when a sentence exceeds 5000 bytes, log a warning and send the sentence as-is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Word tokenizer fallback removed in commit 846392f. Now only using sentence-level splitting as requested. If a single sentence exceeds 5000 bytes, it's sent as-is with a warning logged. |
|
Hi @davidzhao, I've updated the tokenization strategy per your feedback. The word tokenizer fallback has been removed, and now we only use sentence-level splitting via blingfire. If a single sentence exceeds 5000 bytes, we log a warning and send it as-is (to preserve intonation and prosody). Could you please review the changes? |
Summary
Fixes #4762
Google Cloud TTS
synthesize_speechAPI rejectsinput.textorinput.ssmllonger than 5000 bytes. When using non-streaming mode (use_streaming=False), the Google TTS plugin sent the entire input text in a single API request without any byte-length validation. This caused400 INVALID_ARGUMENTerrors even for moderately sized text, especially with multi-byte scripts (e.g., Telugu where each character encodes to 3 bytes in UTF-8).Changes
ChunkedStream._run()that splits input text into chunks that each fit within the 5000-byte API limit.!?etc.), then at word boundaries, and as a last resort at character boundaries<speak></speak>= 15 bytes) is accounted for whenenable_ssml=True_build_ssml()to a static method and extract_build_synthesis_input()for cleaner codeHow it works
For text within the limit (the common case), behavior is identical to before -- a single API call is made. When text exceeds 5000 bytes, the new
_get_text_chunks()method splits it into safe-sized pieces, each of which is synthesized independently and the audio is concatenated.Test plan
_split_text_by_byteslogic with ASCII text, Telugu (multi-byte) text, and SSML overhead edge casesruff checkandruff formatpass cleanlyenable_ssml=Trueconfiguration from the issue