Skip to content

feat(smallestai): word_timestamps for TTS, v4 STT endpoints, eou_timeout fix#5953

Open
harshitajain165 wants to merge 4 commits into
livekit:mainfrom
harshitajain165:feat/smallest-tts-word-timestamps
Open

feat(smallestai): word_timestamps for TTS, v4 STT endpoints, eou_timeout fix#5953
harshitajain165 wants to merge 4 commits into
livekit:mainfrom
harshitajain165:feat/smallest-tts-word-timestamps

Conversation

@harshitajain165
Copy link
Copy Markdown
Contributor

@harshitajain165 harshitajain165 commented Jun 3, 2026

Summary

Three improvements to the livekit-plugins-smallestai plugin:

1. Word-level timestamps for TTS (smallestai.TTS)

Adds opt-in word_timestamps parameter to the Smallest AI WebSocket TTS integration, matching the feature shipped in Lightning v3.1 and v3.1 Pro.

  • New word_timestamps: bool = False constructor parameter
  • Sets aligned_transcript=word_timestamps on TTSCapabilities
  • Sends word_timestamps: true in the WebSocket payload when enabled
  • Handles word_timestamp status events by calling output_emitter.push_timed_transcript(TimedString(...))
  • Supported on base-queue English + Hindi voices (meher, devansh, kartik, maithili, liam, avery); other voices silently emit no word events
tts = smallestai.TTS(
    word_timestamps=True,  # opt in to per-word timed transcript entries
)

2. STT endpoints updated to v4 API format (smallestai.STT)

The Smallest AI API moved from /{model}/get_text (path-based model) to model as a query parameter:

  • Streaming: wss://api.smallest.ai/waves/v1/stt/live?model=pulse
  • Batch: https://api.smallest.ai/waves/v1/stt/?model=pulse

3. Fix eou_timeout_ms bug (smallestai.STT)

The old > 0 guard silently omitted eou_timeout_ms when set to 0, causing the server to apply its 800ms default EOU detection — which conflicts with LiveKit's own VAD-based turn detection.

The fix always sends eou_timeout_ms, so the default of 0 explicitly disables server-side EOU and lets LiveKit's VAD control turn detection entirely. Users who want server-side EOU can pass 100–10000.

Test plan

  • Verify TTS word timestamps fire push_timed_transcript events for supported voices (meher, devansh, etc.)
  • Verify unsupported voices work normally with word_timestamps=True (no errors, just no transcript events)
  • Verify STT streaming connects successfully to new endpoint
  • Verify STT batch transcription works with new endpoint
  • Verify eou_timeout_ms=0 disables server EOU (no server-triggered finals without LiveKit VAD triggering first)
  • Verify eou_timeout_ms=500 enables server EOU at 500ms

Adds opt-in (default on) per-word timing events to the Smallest AI
WebSocket TTS integration, mirroring the pipecat implementation.

- Add word_timestamps: bool = True to _TTSOptions, TTS.__init__,
  and update_options()
- Set aligned_transcript=word_timestamps on TTSCapabilities so the
  framework knows word-level timing is available
- Send word_timestamps: true in the WebSocket payload when enabled
- Handle word_timestamp status events by calling
  output_emitter.push_timed_transcript(TimedString(...))

Supported on base-queue English + Hindi voices (meher, devansh,
kartik, maithili, liam, avery); other voices emit no word events
so the default-on is safe for all voices.
Old format used /{model}/get_text as the path segment.
New API uses /stt/ (batch) and /stt/live (streaming) with
model as a query parameter instead.

- Batch:     https://api.smallest.ai/waves/v1/stt/?model={model}
- Streaming: wss://api.smallest.ai/waves/v1/stt/live?model={model}
Aligns with the raw API behavior — word timestamps are opt-in,
matching docs.smallest.ai which requires passing word_timestamps=true
explicitly to enable the feature.
The old > 0 guard silently omitted the parameter when 0, causing the
server to apply its 800ms default EOU detection — conflicting with
LiveKit's own VAD-based turn detection.

Always send eou_timeout_ms so that the default of 0 explicitly disables
server-side EOU. Users who want server EOU can pass 100–10000.
@harshitajain165 harshitajain165 force-pushed the feat/smallest-tts-word-timestamps branch from 3a5854c to 86d15fe Compare June 3, 2026 10:28
@harshitajain165 harshitajain165 marked this pull request as ready for review June 3, 2026 12:51
@harshitajain165
Copy link
Copy Markdown
Contributor Author

Hey @tinalenguyen
Requesting a review whenever you get a chance

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 STTCapabilities.streaming is hardcoded to True even for pulse-pro which doesn't support streaming

When model="pulse-pro" is passed to STT.__init__(), the capabilities are still constructed with streaming=True (line 148). However, stream() raises ValueError for pulse-pro (lines 245-248). The agent framework checks capabilities.streaming at livekit-agents/livekit/agents/voice/agent.py:423 to decide whether to call stream() directly or wrap with a StreamAdapter. Because streaming=True, the framework will skip the StreamAdapter wrapping and call stream() directly at agent.py:433, which crashes with ValueError("pulse-pro does not support streaming..."). The streaming capability should be conditional on the model.

(Refers to line 148)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a legit comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants