feat(eot): add audio eot model support#1719
Conversation
add audio eot model and local inference support, deprecating silero and turn detector plugins
…frame The AudioFrame emitted on START_OF_SPEECH / END_OF_SPEECH sliced off the prefix-padding samples but still reported `samplesPerChannel = speechBufferIndex`, so the frame's metadata claimed more samples than its data contained and downstream consumers (STT, transcription) lost the pre-roll context the buffer machinery is designed to preserve. Slice from 0 instead so data length matches samplesPerChannel and the prefix-padding pre-roll is delivered, matching the Python original. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dal-eou # Conflicts: # agents/src/voice/agent_activity.ts # agents/src/voice/agent_session.ts # agents/src/voice/audio_recognition.ts # examples/src/gemini_realtime_agent.ts # examples/src/runway_avatar.ts
🦋 Changeset detectedLatest commit: a0f0d71 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
🟡 Early return in runEOUDetection blocks audio EOT detector when STT transcript is empty
When an AudioTurnDetector is active alongside STT, the early return at agents/src/voice/audio_recognition.ts:1325-1328 prevents the audio EOT detector's cached prediction from being consumed on VAD end-of-speech. The condition this.stt && !this.audioTranscript && this.turnDetectionMode !== 'manual' fires whenever STT is enabled but hasn't produced a transcript yet (e.g. STT is slow or the utterance is very short). Since turnDetectionMode is undefined when an AudioTurnDetector is wired in (not 'manual'), the function exits without ever reaching the audio detector's predictEndOfTurn call below.
The audio model was already warmed up during the silence window and activated on VAD EOS (agents/src/voice/audio_recognition.ts:1794), so a cached prediction is available. But the system must wait for the STT FINAL_TRANSCRIPT before consuming it, adding STT_latency - audio_model_latency of unnecessary delay. This defeats the core value proposition of the audio EOT model (committing faster than waiting for STT).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is expected, since we cannot commit a turn when the STT transcript is empty.
…dal-EOU Resolved conflicts: - agents/src/voice/audio_recognition.ts: combined the EOU branch's hasUserVad / userSpeakingEvent (bounce-race) model with main's sttLastSpeakingTime and vadStream.flush() refactor. - agents/package.json: kept @livekit/protocol ^1.46.5 (needed for EOT proto types); dropped the duplicate @livekit/typed-emitter entry. - pnpm-lock.yaml: regenerated via pnpm install from main's base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The merge commit captured a stale lockfile (pre-`pnpm install`), leaving `@livekit/protocol` at ^1.46.4 and omitting `@livekit/local-inference`. Regenerated so `pnpm install --frozen-lockfile` passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| if (rescaled !== undefined) { | ||
| this._thresholds = rescaled; | ||
| this._default = this.lookup('en'); | ||
| } |
There was a problem hiding this comment.
🟡 _toLocalFallback rescaling drops local thresholds for languages absent from the cloud server map
In _toLocalFallback, after _resolve() populates _thresholds with the full LOCAL_LANGUAGES table (plus any user overrides), line 210 replaces _thresholds entirely with the rescaled map. However, rescaled only contains languages that were present in the old cloud server map (server). Languages that exist in LOCAL_LANGUAGES but were not in the server's response (e.g., the server only returned {en, ja, fr} while local supports 14 languages) are silently dropped. After fallback, lookup() for those languages returns _default (the rescaled English value) instead of their proper per-language local threshold.
The fix is to merge rescaled into the existing _thresholds (produced by _resolve()) rather than replacing it, so uncovered languages retain their local defaults.
Concrete example
If SERVER_THRESHOLDS = { en: 0.56, ja: 0.37, fr: 0.575 } and LOCAL_LANGUAGES has 14 entries, after fallback lookup('de') returns the rescaled English value (~0.36) instead of LOCAL_LANGUAGES.de (0.245).
| if (rescaled !== undefined) { | |
| this._thresholds = rescaled; | |
| this._default = this.lookup('en'); | |
| } | |
| if (rescaled !== undefined) { | |
| this._thresholds = { ...this._thresholds, ...rescaled }; | |
| this._default = this.lookup('en'); | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| const workerToken = process.env.LIVEKIT_WORKER_TOKEN; | ||
| if (workerToken) { | ||
| headers['X-LiveKit-Worker-Token'] = workerToken; | ||
| } |
There was a problem hiding this comment.
🟡 LIVEKIT_WORKER_TOKEN header only sent when job context exists
The LIVEKIT_WORKER_TOKEN environment variable is read inside the if (ctx) block at agents/src/inference/utils.ts:74-77, making it conditional on getJobContext(false) returning a non-null value. Unlike the other headers in that block (X-LiveKit-Room-Id, X-LiveKit-Job-Id, X-LiveKit-Agent-Id) which are derived from the job context, the worker token is a process-level env var that identifies the worker itself. If buildMetadataHeaders() is ever called outside a job context (or if getJobContext(false) returns null in an edge case), the worker token is silently omitted from the request headers, even though it's available in the environment.
Prompt for agents
In agents/src/inference/utils.ts, the LIVEKIT_WORKER_TOKEN env var read at lines 74-77 is nested inside the `if (ctx)` block (the getJobContext check). The worker token is a process-level credential, not derived from the job context like the surrounding Room-Id/Job-Id/Agent-Id headers. Move the worker token block (lines 74-77) outside and after the `if (ctx) { ... }` block, so the token is included in metadata headers regardless of whether a job context is available. The other headers inside the if-block correctly depend on ctx and should stay where they are.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is expected because it is designed for hosted agents and they always have a job context.
add audio eot model and local inference support, deprecating silero and turn detector plugins## Description
Changes Made
Adds streaming audio end-of-turn detection. Single user-facing
AudioTurnDetectorthat selects between two backends:turn-detectorturn-detector-miniOn cloud transport error or
predict_end_of_turntimeout, the session swaps to mini/local for the rest of the stream (sticky per session, one warning per failure mode).Local failures emit the default
1.0prediction and retry on the next turn.A user-set
unlikely_thresholdis scaled multiplicatively against the cloud default so the operating point survives a fallback.Pre-Review Checklist
Testing
restaurant_agent.tsandrealtime_agent.tswork properly (for major changes)Additional Notes
Python PR: livekit/agents#4722
Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.