Skip to content

fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004

Open
weiguangli-io wants to merge 7 commits intolivekit:mainfrom
weiguangli-io:codex/livekit-4762-google-tts-limit
Open

fix(google): split text exceeding 5000-byte API limit in TTS synthesize#5004
weiguangli-io wants to merge 7 commits intolivekit:mainfrom
weiguangli-io:codex/livekit-4762-google-tts-limit

Conversation

@weiguangli-io
Copy link

Summary

Fixes #4762

Google Cloud TTS synthesize_speech API rejects input.text or input.ssml longer than 5000 bytes. When using non-streaming mode (use_streaming=False), the Google TTS plugin sent the entire input text in a single API request without any byte-length validation. This caused 400 INVALID_ARGUMENT errors even for moderately sized text, especially with multi-byte scripts (e.g., Telugu where each character encodes to 3 bytes in UTF-8).

Changes

  • Add automatic text chunking in ChunkedStream._run() that splits input text into chunks that each fit within the 5000-byte API limit
  • Text is split hierarchically: first at sentence boundaries (.!? etc.), then at word boundaries, and as a last resort at character boundaries
  • SSML wrapper overhead (<speak></speak> = 15 bytes) is accounted for when enable_ssml=True
  • Refactor _build_ssml() to a static method and extract _build_synthesis_input() for cleaner code
  • Audio responses from multiple chunks are pushed sequentially to the output emitter, producing seamless audio output

How it works

For text within the limit (the common case), behavior is identical to before -- a single API call is made. When text exceeds 5000 bytes, the new _get_text_chunks() method splits it into safe-sized pieces, each of which is synthesized independently and the audio is concatenated.

Test plan

  • Verified _split_text_by_bytes logic with ASCII text, Telugu (multi-byte) text, and SSML overhead edge cases
  • Verified ruff check and ruff format pass cleanly
  • Integration test with Google Cloud TTS using Telugu language and enable_ssml=True configuration from the issue

Google Cloud TTS synthesize_speech API rejects input.text or input.ssml
longer than 5000 bytes. When using non-streaming mode (use_streaming=False),
the plugin sent the full text in a single request without checking byte
length, causing 400 INVALID_ARGUMENT errors even for moderately sized text
-- especially with multi-byte scripts like Telugu.

This change adds automatic text chunking in ChunkedStream._run() that
splits input at sentence, word, and character boundaries to stay within
the 5000-byte limit. SSML wrapper overhead (<speak></speak>) is also
accounted for when enable_ssml=True.

Fixes livekit#4762
@CLAassistant
Copy link

CLAassistant commented Mar 4, 2026

CLA assistant check
All committers have signed the CLA.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, let's update the tokenization strategy and we can merge it

else:
max_bytes = GOOGLE_TTS_MAX_INPUT_BYTES

return _split_text_by_bytes(self._input_text, max_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tokenizer should be used instead of hand crafting one. see: https://github.com/livekit/agents/blob/main/livekit-plugins/livekit-plugins-cartesia/livekit/plugins/cartesia/tts.py#L399 as an example

we already bundle blingfire for tokenization

…ex for text splitting

Replace the regex-based sentence splitting (re.split on punctuation) with the
blingfire SentenceTokenizer that is already bundled with livekit-agents. This
follows the same pattern used by the Cartesia TTS plugin and provides more
robust sentence boundary detection.
devin-ai-integration[bot]

This comment was marked as resolved.

weiguangli-io and others added 2 commits March 6, 2026 17:51
Structured content (SSML, markup) cannot be naively split by the
sentence tokenizer as it would break XML tag structure. Return
such content as-is and only apply byte-limit splitting to plain text.

Remove the now-unused SSML_WRAPPER_OVERHEAD constant.
Replace the standalone _split_text_by_bytes, _split_on_words, and
_split_on_chars helper functions with the bundled blingfire sentence
tokenizer and basic word tokenizer for text chunking in _get_text_chunks.
This aligns with the tokenization pattern used in other plugins
(e.g. Cartesia).

Signed-off-by: OiPunk <codingpunk@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Container-based formats (MP3, OGG_OPUS) produce independently encoded
files with their own headers for each synthesize_speech call. When
multiple chunks are concatenated into a single AudioEmitter, the decoder
cannot handle multiple container files, resulting in corrupted or
truncated audio. Fall back to PCM (raw samples) which is safe to
concatenate.

Signed-off-by: OiPunk <codingpunk@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…enizer

Move the byte-limit chunking logic from _get_text_chunks into a
standalone _split_sentences_by_bytes function for better testability
and separation of concerns. Promote the WordTokenizer instance to a
module-level constant to avoid re-creating it on every call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +308 to +309
default) for boundary detection, with a word tokenizer fallback for
sentences that individually exceed the byte limit. Structured content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a word tokenizer fallback here. do you see a realistic use case where a sentence is more than 5000 chars?

synthesizing part of a sentence will generally not produce the right intonations or prosody

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — a single sentence realistically won't exceed 5000 bytes. I'll remove the word tokenizer fallback and keep it simple with just sentence-level splitting. Will push the update shortly.

Per maintainer feedback: splitting mid-sentence produces incorrect
intonation. Instead of falling back to word-level tokenization when a
sentence exceeds 5000 bytes, log a warning and send the sentence as-is.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@weiguangli-io
Copy link
Author

Word tokenizer fallback removed in commit 846392f. Now only using sentence-level splitting as requested. If a single sentence exceeds 5000 bytes, it's sent as-is with a warning logged.

@weiguangli-io
Copy link
Author

Hi @davidzhao, I've updated the tokenization strategy per your feedback. The word tokenizer fallback has been removed, and now we only use sentence-level splitting via blingfire. If a single sentence exceeds 5000 bytes, we log a warning and send it as-is (to preserve intonation and prosody). Could you please review the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Google TTS fails with “input.text or input.ssml longer than 5000 bytes” even for short utterances (LiveKit Agents)

3 participants