Skip to content

feat(eot): add audio eot model support#1719

Open
chenghao-mou wants to merge 17 commits into
mainfrom
feat/AGT-2520-multimodal-EOU
Open

feat(eot): add audio eot model support#1719
chenghao-mou wants to merge 17 commits into
mainfrom
feat/AGT-2520-multimodal-EOU

Conversation

@chenghao-mou
Copy link
Copy Markdown
Member

@chenghao-mou chenghao-mou commented Jun 5, 2026

add audio eot model and local inference support, deprecating silero and turn detector plugins## Description

Changes Made

Adds streaming audio end-of-turn detection. Single user-facing AudioTurnDetector that selects between two backends:

  • turn-detector
  • turn-detector-mini

On cloud transport error or predict_end_of_turn timeout, the session swaps to mini/local for the rest of the stream (sticky per session, one warning per failure mode).
Local failures emit the default 1.0 prediction and retry on the next turn.

A user-set unlikely_threshold is scaled multiplicatively against the cloud default so the operating point survives a fallback.

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included
  • Video demo: A small video demo showing changes works as expected and did not break any existing functionality using Agent Playground (if applicable)

Testing

  • Automated tests added/updated (if applicable)
  • All tests pass
  • Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)

Additional Notes

Python PR: livekit/agents#4722


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

chenghao-mou and others added 13 commits May 27, 2026 01:14
add audio eot model and local inference support, deprecating silero and turn detector plugins
…frame

The AudioFrame emitted on START_OF_SPEECH / END_OF_SPEECH sliced off
the prefix-padding samples but still reported `samplesPerChannel =
speechBufferIndex`, so the frame's metadata claimed more samples than
its data contained and downstream consumers (STT, transcription) lost
the pre-roll context the buffer machinery is designed to preserve.

Slice from 0 instead so data length matches samplesPerChannel and the
prefix-padding pre-roll is delivered, matching the Python original.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dal-eou

# Conflicts:
#	agents/src/voice/agent_activity.ts
#	agents/src/voice/agent_session.ts
#	agents/src/voice/audio_recognition.ts
#	examples/src/gemini_realtime_agent.ts
#	examples/src/runway_avatar.ts
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 5, 2026

🦋 Changeset detected

Latest commit: a0f0d71

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Early return in runEOUDetection blocks audio EOT detector when STT transcript is empty

When an AudioTurnDetector is active alongside STT, the early return at agents/src/voice/audio_recognition.ts:1325-1328 prevents the audio EOT detector's cached prediction from being consumed on VAD end-of-speech. The condition this.stt && !this.audioTranscript && this.turnDetectionMode !== 'manual' fires whenever STT is enabled but hasn't produced a transcript yet (e.g. STT is slow or the utterance is very short). Since turnDetectionMode is undefined when an AudioTurnDetector is wired in (not 'manual'), the function exits without ever reaching the audio detector's predictEndOfTurn call below.

The audio model was already warmed up during the silence window and activated on VAD EOS (agents/src/voice/audio_recognition.ts:1794), so a cached prediction is available. But the system must wait for the STT FINAL_TRANSCRIPT before consuming it, adding STT_latency - audio_model_latency of unnecessary delay. This defeats the core value proposition of the audio EOT model (committing faster than waiting for STT).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected, since we cannot commit a turn when the STT transcript is empty.

chenghao-mou and others added 2 commits June 5, 2026 01:19
…dal-EOU

Resolved conflicts:
- agents/src/voice/audio_recognition.ts: combined the EOU branch's
  hasUserVad / userSpeakingEvent (bounce-race) model with main's
  sttLastSpeakingTime and vadStream.flush() refactor.
- agents/package.json: kept @livekit/protocol ^1.46.5 (needed for EOT
  proto types); dropped the duplicate @livekit/typed-emitter entry.
- pnpm-lock.yaml: regenerated via pnpm install from main's base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The merge commit captured a stale lockfile (pre-`pnpm install`), leaving
`@livekit/protocol` at ^1.46.4 and omitting `@livekit/local-inference`.
Regenerated so `pnpm install --frozen-lockfile` passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines +209 to +212
if (rescaled !== undefined) {
this._thresholds = rescaled;
this._default = this.lookup('en');
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _toLocalFallback rescaling drops local thresholds for languages absent from the cloud server map

In _toLocalFallback, after _resolve() populates _thresholds with the full LOCAL_LANGUAGES table (plus any user overrides), line 210 replaces _thresholds entirely with the rescaled map. However, rescaled only contains languages that were present in the old cloud server map (server). Languages that exist in LOCAL_LANGUAGES but were not in the server's response (e.g., the server only returned {en, ja, fr} while local supports 14 languages) are silently dropped. After fallback, lookup() for those languages returns _default (the rescaled English value) instead of their proper per-language local threshold.

The fix is to merge rescaled into the existing _thresholds (produced by _resolve()) rather than replacing it, so uncovered languages retain their local defaults.

Concrete example

If SERVER_THRESHOLDS = { en: 0.56, ja: 0.37, fr: 0.575 } and LOCAL_LANGUAGES has 14 entries, after fallback lookup('de') returns the rescaled English value (~0.36) instead of LOCAL_LANGUAGES.de (0.245).

Suggested change
if (rescaled !== undefined) {
this._thresholds = rescaled;
this._default = this.lookup('en');
}
if (rescaled !== undefined) {
this._thresholds = { ...this._thresholds, ...rescaled };
this._default = this.lookup('en');
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 14 additional findings in Devin Review.

Open in Devin Review

Comment on lines +74 to +77
const workerToken = process.env.LIVEKIT_WORKER_TOKEN;
if (workerToken) {
headers['X-LiveKit-Worker-Token'] = workerToken;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 LIVEKIT_WORKER_TOKEN header only sent when job context exists

The LIVEKIT_WORKER_TOKEN environment variable is read inside the if (ctx) block at agents/src/inference/utils.ts:74-77, making it conditional on getJobContext(false) returning a non-null value. Unlike the other headers in that block (X-LiveKit-Room-Id, X-LiveKit-Job-Id, X-LiveKit-Agent-Id) which are derived from the job context, the worker token is a process-level env var that identifies the worker itself. If buildMetadataHeaders() is ever called outside a job context (or if getJobContext(false) returns null in an edge case), the worker token is silently omitted from the request headers, even though it's available in the environment.

Prompt for agents
In agents/src/inference/utils.ts, the LIVEKIT_WORKER_TOKEN env var read at lines 74-77 is nested inside the `if (ctx)` block (the getJobContext check). The worker token is a process-level credential, not derived from the job context like the surrounding Room-Id/Job-Id/Agent-Id headers. Move the worker token block (lines 74-77) outside and after the `if (ctx) { ... }` block, so the token is included in metadata headers regardless of whether a job context is available. The other headers inside the if-block correctly depend on ctx and should stay where they are.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected because it is designed for hosted agents and they always have a job context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant