fix(google): split text exceeding 5000-byte API limit in TTS synthesize by weiguangli-io · Pull Request #5004 · livekit/agents

weiguangli-io · 2026-03-04T17:59:04Z

Summary

Google Cloud TTS synthesize_speech API rejects input.text or input.ssml longer than 5000 bytes. When using non-streaming mode (use_streaming=False), the Google TTS plugin sent the entire input text in a single API request without any byte-length validation. This caused 400 INVALID_ARGUMENT errors even for moderately sized text, especially with multi-byte scripts (e.g., Telugu where each character encodes to 3 bytes in UTF-8).

Changes

Add automatic text chunking in ChunkedStream._run() that splits input text into chunks that each fit within the 5000-byte API limit
Text is split hierarchically: first at sentence boundaries (.!? etc.), then at word boundaries, and as a last resort at character boundaries
SSML wrapper overhead (<speak></speak> = 15 bytes) is accounted for when enable_ssml=True
Refactor _build_ssml() to a static method and extract _build_synthesis_input() for cleaner code
Audio responses from multiple chunks are pushed sequentially to the output emitter, producing seamless audio output

How it works

For text within the limit (the common case), behavior is identical to before -- a single API call is made. When text exceeds 5000 bytes, the new _get_text_chunks() method splits it into safe-sized pieces, each of which is synthesized independently and the audio is concatenated.

Test plan

Verified _split_text_by_bytes logic with ASCII text, Telugu (multi-byte) text, and SSML overhead edge cases
Verified ruff check and ruff format pass cleanly
Integration test with Google Cloud TTS using Telugu language and enable_ssml=True configuration from the issue

Google Cloud TTS synthesize_speech API rejects input.text or input.ssml longer than 5000 bytes. When using non-streaming mode (use_streaming=False), the plugin sent the full text in a single request without checking byte length, causing 400 INVALID_ARGUMENT errors even for moderately sized text -- especially with multi-byte scripts like Telugu. This change adds automatic text chunking in ChunkedStream._run() that splits input at sentence, word, and character boundaries to stay within the 5000-byte limit. SSML wrapper overhead (<speak></speak>) is also accounted for when enable_ssml=True. Fixes livekit#4762

CLAassistant · 2026-03-04T17:59:17Z

All committers have signed the CLA.

davidzhao

nice catch, let's update the tokenization strategy and we can merge it

davidzhao · 2026-03-05T06:07:32Z

livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py

+        else:
+            max_bytes = GOOGLE_TTS_MAX_INPUT_BYTES
+
+        return _split_text_by_bytes(self._input_text, max_bytes)


a tokenizer should be used instead of hand crafting one. see: https://github.com/livekit/agents/blob/main/livekit-plugins/livekit-plugins-cartesia/livekit/plugins/cartesia/tts.py#L399 as an example

we already bundle blingfire for tokenization

…ex for text splitting Replace the regex-based sentence splitting (re.split on punctuation) with the blingfire SentenceTokenizer that is already bundled with livekit-agents. This follows the same pattern used by the Cartesia TTS plugin and provides more robust sentence boundary detection.

Structured content (SSML, markup) cannot be naively split by the sentence tokenizer as it would break XML tag structure. Return such content as-is and only apply byte-limit splitting to plain text. Remove the now-unused SSML_WRAPPER_OVERHEAD constant.

Replace the standalone _split_text_by_bytes, _split_on_words, and _split_on_chars helper functions with the bundled blingfire sentence tokenizer and basic word tokenizer for text chunking in _get_text_chunks. This aligns with the tokenization pattern used in other plugins (e.g. Cartesia). Signed-off-by: OiPunk <codingpunk@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Container-based formats (MP3, OGG_OPUS) produce independently encoded files with their own headers for each synthesize_speech call. When multiple chunks are concatenated into a single AudioEmitter, the decoder cannot handle multiple container files, resulting in corrupted or truncated audio. Fall back to PCM (raw samples) which is safe to concatenate. Signed-off-by: OiPunk <codingpunk@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…enizer Move the byte-limit chunking logic from _get_text_chunks into a standalone _split_sentences_by_bytes function for better testability and separation of concerns. Promote the WordTokenizer instance to a module-level constant to avoid re-creating it on every call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

davidzhao · 2026-03-09T06:26:30Z

livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py

+        default) for boundary detection, with a word tokenizer fallback for
+        sentences that individually exceed the byte limit. Structured content


I don't think we need a word tokenizer fallback here. do you see a realistic use case where a sentence is more than 5000 chars?

synthesizing part of a sentence will generally not produce the right intonations or prosody

Good point — a single sentence realistically won't exceed 5000 bytes. I'll remove the word tokenizer fallback and keep it simple with just sentence-level splitting. Will push the update shortly.

Per maintainer feedback: splitting mid-sentence produces incorrect intonation. Instead of falling back to word-level tokenization when a sentence exceeds 5000 bytes, log a warning and send the sentence as-is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weiguangli-io · 2026-03-09T12:34:36Z

Word tokenizer fallback removed in commit 846392f. Now only using sentence-level splitting as requested. If a single sentence exceeds 5000 bytes, it's sent as-is with a warning logged.

weiguangli-io · 2026-03-09T12:57:42Z

Hi @davidzhao, I've updated the tokenization strategy per your feedback. The word tokenizer fallback has been removed, and now we only use sentence-level splitting via blingfire. If a single sentence exceeds 5000 bytes, we log a warning and send it as-is (to preserve intonation and prosody). Could you please review the changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004

fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004
weiguangli-io wants to merge 7 commits intolivekit:mainfrom
weiguangli-io:codex/livekit-4762-google-tts-limit

weiguangli-io commented Mar 4, 2026

Uh oh!

CLAassistant commented Mar 4, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao left a comment

Uh oh!

davidzhao Mar 5, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao Mar 9, 2026

Uh oh!

weiguangli-io Mar 9, 2026

Uh oh!

weiguangli-io commented Mar 9, 2026

Uh oh!

weiguangli-io commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		default) for boundary detection, with a word tokenizer fallback for
		sentences that individually exceed the byte limit. Structured content

Conversation

weiguangli-io commented Mar 4, 2026

Summary

Changes

How it works

Test plan

Uh oh!

CLAassistant commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

davidzhao Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

weiguangli-io Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

weiguangli-io commented Mar 9, 2026

Uh oh!

weiguangli-io commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 4, 2026 •

edited

Loading