feat(eot): add audio eot model support by chenghao-mou · Pull Request #1719 · livekit/agents-js

chenghao-mou · 2026-06-05T00:10:34Z

add audio eot model and local inference support, deprecating silero and turn detector plugins## Description

Changes Made

Adds streaming audio end-of-turn detection. Single user-facing AudioTurnDetector that selects between two backends:

turn-detector
turn-detector-mini

On cloud transport error or predict_end_of_turn timeout, the session swaps to mini/local for the rest of the stream (sticky per session, one warning per failure mode).
Local failures emit the default 1.0 prediction and retry on the next turn.

A user-set unlikely_threshold is scaled multiplicatively against the cloud default so the operating point survives a fallback.

Pre-Review Checklist

Build passes: All builds (lint, typecheck, tests) pass locally
AI-generated code reviewed: Removed unnecessary comments and ensured code quality
Changes explained: All changes are properly documented and justified above
Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included
Video demo: A small video demo showing changes works as expected and did not break any existing functionality using Agent Playground (if applicable)

Testing

Automated tests added/updated (if applicable)
All tests pass
Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)

Additional Notes

Python PR: livekit/agents#4722

Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

add audio eot model and local inference support, deprecating silero and turn detector plugins

…frame The AudioFrame emitted on START_OF_SPEECH / END_OF_SPEECH sliced off the prefix-padding samples but still reported `samplesPerChannel = speechBufferIndex`, so the frame's metadata claimed more samples than its data contained and downstream consumers (STT, transcription) lost the pre-roll context the buffer machinery is designed to preserve. Slice from 0 instead so data length matches samplesPerChannel and the prefix-padding pre-roll is delivered, matching the Python original. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dal-eou # Conflicts: # agents/src/voice/agent_activity.ts # agents/src/voice/agent_session.ts # agents/src/voice/audio_recognition.ts # examples/src/gemini_realtime_agent.ts # examples/src/runway_avatar.ts

changeset-bot · 2026-06-05T00:10:40Z

🦋 Changeset detected

Latest commit: a0f0d71

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-06-05T00:15:13Z

🟡 Early return in runEOUDetection blocks audio EOT detector when STT transcript is empty

When an AudioTurnDetector is active alongside STT, the early return at agents/src/voice/audio_recognition.ts:1325-1328 prevents the audio EOT detector's cached prediction from being consumed on VAD end-of-speech. The condition this.stt && !this.audioTranscript && this.turnDetectionMode !== 'manual' fires whenever STT is enabled but hasn't produced a transcript yet (e.g. STT is slow or the utterance is very short). Since turnDetectionMode is undefined when an AudioTurnDetector is wired in (not 'manual'), the function exits without ever reaching the audio detector's predictEndOfTurn call below.

The audio model was already warmed up during the silence window and activated on VAD EOS (agents/src/voice/audio_recognition.ts:1794), so a cached prediction is available. But the system must wait for the STT FINAL_TRANSCRIPT before consuming it, adding STT_latency - audio_model_latency of unnecessary delay. This defeats the core value proposition of the audio EOT model (committing faster than waiting for STT).

Was this helpful? React with 👍 or 👎 to provide feedback.

This is expected, since we cannot commit a turn when the STT transcript is empty.

…dal-EOU Resolved conflicts: - agents/src/voice/audio_recognition.ts: combined the EOU branch's hasUserVad / userSpeakingEvent (bounce-race) model with main's sttLastSpeakingTime and vadStream.flush() refactor. - agents/package.json: kept @livekit/protocol ^1.46.5 (needed for EOT proto types); dropped the duplicate @livekit/typed-emitter entry. - pnpm-lock.yaml: regenerated via pnpm install from main's base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The merge commit captured a stale lockfile (pre-`pnpm install`), leaving `@livekit/protocol` at ^1.46.4 and omitting `@livekit/local-inference`. Regenerated so `pnpm install --frozen-lockfile` passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-06-05T00:31:04Z

+    if (rescaled !== undefined) {
+      this._thresholds = rescaled;
+      this._default = this.lookup('en');
+    }


🟡 _toLocalFallback rescaling drops local thresholds for languages absent from the cloud server map

In _toLocalFallback, after _resolve() populates _thresholds with the full LOCAL_LANGUAGES table (plus any user overrides), line 210 replaces _thresholds entirely with the rescaled map. However, rescaled only contains languages that were present in the old cloud server map (server). Languages that exist in LOCAL_LANGUAGES but were not in the server's response (e.g., the server only returned {en, ja, fr} while local supports 14 languages) are silently dropped. After fallback, lookup() for those languages returns _default (the rescaled English value) instead of their proper per-language local threshold.

The fix is to merge rescaled into the existing _thresholds (produced by _resolve()) rather than replacing it, so uncovered languages retain their local defaults.

Concrete example

If SERVER_THRESHOLDS = { en: 0.56, ja: 0.37, fr: 0.575 } and LOCAL_LANGUAGES has 14 entries, after fallback lookup('de') returns the rescaled English value (~0.36) instead of LOCAL_LANGUAGES.de (0.245).

Suggested change

if (rescaled !== undefined) {

this._thresholds = rescaled;

this._default = this.lookup('en');

}

if (rescaled !== undefined) {

this._thresholds = { ...this._thresholds, ...rescaled };

this._default = this.lookup('en');

}

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 1 new potential issue.

View 14 additional findings in Devin Review.

devin-ai-integration · 2026-06-05T01:05:38Z

+    const workerToken = process.env.LIVEKIT_WORKER_TOKEN;
+    if (workerToken) {
+      headers['X-LiveKit-Worker-Token'] = workerToken;
+    }


🟡 LIVEKIT_WORKER_TOKEN header only sent when job context exists

The LIVEKIT_WORKER_TOKEN environment variable is read inside the if (ctx) block at agents/src/inference/utils.ts:74-77, making it conditional on getJobContext(false) returning a non-null value. Unlike the other headers in that block (X-LiveKit-Room-Id, X-LiveKit-Job-Id, X-LiveKit-Agent-Id) which are derived from the job context, the worker token is a process-level env var that identifies the worker itself. If buildMetadataHeaders() is ever called outside a job context (or if getJobContext(false) returns null in an edge case), the worker token is silently omitted from the request headers, even though it's available in the environment.

Prompt for agents

In agents/src/inference/utils.ts, the LIVEKIT_WORKER_TOKEN env var read at lines 74-77 is nested inside the `if (ctx)` block (the getJobContext check). The worker token is a process-level credential, not derived from the job context like the surrounding Room-Id/Job-Id/Agent-Id headers. Move the worker token block (lines 74-77) outside and after the `if (ctx) { ... }` block, so the token is included in metadata headers regardless of whether a job context is available. The other headers inside the if-block correctly depend on ctx and should stay where they are.

Was this helpful? React with 👍 or 👎 to provide feedback.

This is expected because it is designed for hosted agents and they always have a job context.

chenghao-mou and others added 13 commits May 27, 2026 01:14

feat(eot): add audio eot model support

b4ad9eb

add audio eot model and local inference support, deprecating silero and turn detector plugins

Create busy-aliens-wink.md

96c4563

more clean up and refactoring

80e6ab3

more refactoring and clean up

4dc88ce

more refactoring and clean up

e7fdb49

address comment

efaed38

address comment

242a5ff

rename backend to model

eec5078

Merge remote-tracking branch 'origin/main' into feat/AGT-2520-multimo…

2832616

…dal-eou # Conflicts: # agents/src/voice/agent_activity.ts # agents/src/voice/agent_session.ts # agents/src/voice/audio_recognition.ts # examples/src/gemini_realtime_agent.ts # examples/src/runway_avatar.ts

address comments

86d8aa7

update default parsing and read from cloud

5813b18

reformat

791e038

chenghao-mou mentioned this pull request Jun 5, 2026

feat(eot): add audio eot model support #1613

Closed

8 tasks

devin-ai-integration Bot reviewed Jun 5, 2026

View reviewed changes

chenghao-mou and others added 2 commits June 5, 2026 01:19

devin-ai-integration Bot reviewed Jun 5, 2026

View reviewed changes

chenghao-mou added 2 commits June 5, 2026 01:49

fix port misses

5ef0fb8

add missing worker token

a0f0d71

devin-ai-integration Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eot): add audio eot model support#1719

feat(eot): add audio eot model support#1719
chenghao-mou wants to merge 17 commits into
mainfrom
feat/AGT-2520-multimodal-EOU

chenghao-mou commented Jun 5, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026 •

edited

Loading

Uh oh!

chenghao-mou Jun 5, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Uh oh!

chenghao-mou Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenghao-mou commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Pre-Review Checklist

Testing

Additional Notes

Uh oh!

changeset-bot Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenghao-mou commented Jun 5, 2026 •

edited

Loading

changeset-bot Bot commented Jun 5, 2026 •

edited

Loading

devin-ai-integration Bot Jun 5, 2026 •

edited

Loading