feat(smallestai): word_timestamps for TTS, v4 STT endpoints, eou_timeout fix#5953
feat(smallestai): word_timestamps for TTS, v4 STT endpoints, eou_timeout fix#5953harshitajain165 wants to merge 4 commits into
Conversation
Adds opt-in (default on) per-word timing events to the Smallest AI WebSocket TTS integration, mirroring the pipecat implementation. - Add word_timestamps: bool = True to _TTSOptions, TTS.__init__, and update_options() - Set aligned_transcript=word_timestamps on TTSCapabilities so the framework knows word-level timing is available - Send word_timestamps: true in the WebSocket payload when enabled - Handle word_timestamp status events by calling output_emitter.push_timed_transcript(TimedString(...)) Supported on base-queue English + Hindi voices (meher, devansh, kartik, maithili, liam, avery); other voices emit no word events so the default-on is safe for all voices.
Old format used /{model}/get_text as the path segment.
New API uses /stt/ (batch) and /stt/live (streaming) with
model as a query parameter instead.
- Batch: https://api.smallest.ai/waves/v1/stt/?model={model}
- Streaming: wss://api.smallest.ai/waves/v1/stt/live?model={model}
Aligns with the raw API behavior — word timestamps are opt-in, matching docs.smallest.ai which requires passing word_timestamps=true explicitly to enable the feature.
The old > 0 guard silently omitted the parameter when 0, causing the server to apply its 800ms default EOU detection — conflicting with LiveKit's own VAD-based turn detection. Always send eou_timeout_ms so that the default of 0 explicitly disables server-side EOU. Users who want server EOU can pass 100–10000.
3a5854c to
86d15fe
Compare
|
Hey @tinalenguyen |
There was a problem hiding this comment.
🔴 STTCapabilities.streaming is hardcoded to True even for pulse-pro which doesn't support streaming
When model="pulse-pro" is passed to STT.__init__(), the capabilities are still constructed with streaming=True (line 148). However, stream() raises ValueError for pulse-pro (lines 245-248). The agent framework checks capabilities.streaming at livekit-agents/livekit/agents/voice/agent.py:423 to decide whether to call stream() directly or wrap with a StreamAdapter. Because streaming=True, the framework will skip the StreamAdapter wrapping and call stream() directly at agent.py:433, which crashes with ValueError("pulse-pro does not support streaming..."). The streaming capability should be conditional on the model.
(Refers to line 148)
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Three improvements to the
livekit-plugins-smallestaiplugin:1. Word-level timestamps for TTS (
smallestai.TTS)Adds opt-in
word_timestampsparameter to the Smallest AI WebSocket TTS integration, matching the feature shipped in Lightning v3.1 and v3.1 Pro.word_timestamps: bool = Falseconstructor parameteraligned_transcript=word_timestampsonTTSCapabilitiesword_timestamps: truein the WebSocket payload when enabledword_timestampstatus events by callingoutput_emitter.push_timed_transcript(TimedString(...))meher,devansh,kartik,maithili,liam,avery); other voices silently emit no word events2. STT endpoints updated to v4 API format (
smallestai.STT)The Smallest AI API moved from
/{model}/get_text(path-based model) to model as a query parameter:wss://api.smallest.ai/waves/v1/stt/live?model=pulsehttps://api.smallest.ai/waves/v1/stt/?model=pulse3. Fix
eou_timeout_msbug (smallestai.STT)The old
> 0guard silently omittedeou_timeout_mswhen set to0, causing the server to apply its 800ms default EOU detection — which conflicts with LiveKit's own VAD-based turn detection.The fix always sends
eou_timeout_ms, so the default of0explicitly disables server-side EOU and lets LiveKit's VAD control turn detection entirely. Users who want server-side EOU can pass100–10000.Test plan
push_timed_transcriptevents for supported voices (meher,devansh, etc.)word_timestamps=True(no errors, just no transcript events)eou_timeout_ms=0disables server EOU (no server-triggered finals without LiveKit VAD triggering first)eou_timeout_ms=500enables server EOU at 500ms