fix(voice): make non-streaming STT work for turn detection by chenghao-mou · Pull Request #6240 · livekit/agents

chenghao-mou · 2026-06-26T11:01:00Z

In turn_detection="stt" mode, a non-streaming STT (e.g. wrapped by stt.StreamAdapter — including a FallbackAdapter failing over to a non-streaming provider like gpt-4o-mini-transcribe) emits END_OF_SPEECH before recognize() returns the transcript. So the STT-end commit path runs with an empty transcript and bails, and the final transcript is left to commit via the FINAL_TRANSCRIPT handler — which is gated on not self._speaking.

When recognize() is slow and VAD re-detects speech during that latency window, _speaking is True again by the time the final lands, so the turn is never committed and stalls — even though the transcript did arrive. This matches the trace: the transcript reached the user_turn/FallbackAdapter span, VAD was still detecting speech, and the turn never progressed.

Fix

Commit the final transcript regardless of _speaking once end-of-turn was already signalled for the segment (_user_turn_committed) — the transcript belongs to the segment that just ended. Streaming STT is unaffected: it sends FINAL before END_OF_SPEECH, so _user_turn_committed is still False at the final and the existing END_OF_SPEECH commit path is used.

Tests

test_late_stt_final_commits_turn_when_vad_redetects_speech — reproduces the stall (StreamAdapter ordering + VAD re-detecting speech during recognize) and verifies the turn now commits.
test_stt_final_while_speaking_does_not_commit_without_end_of_speech — regression guard: a final while still speaking, with no end-of-turn signalled, must not commit.

…rn detection In turn_detection="stt" mode, a non-streaming STT (e.g. wrapped by stt.StreamAdapter, including a FallbackAdapter failing over to a non-streaming provider like gpt-4o-mini-transcribe) emits END_OF_SPEECH *before* recognize() returns the transcript. The STT-end commit path therefore runs with an empty transcript and bails, leaving the final transcript to commit via the FINAL_TRANSCRIPT handler — but that path is gated on `not self._speaking`. When recognize() is slow and VAD re-detects speech during that latency window, `_speaking` is True again by the time the final lands, so the turn is never committed and stalls, even though the transcript arrived (AGT-3051). Commit the final regardless of `_speaking` once end-of-turn was already signalled for the segment (`_user_turn_committed`); the transcript belongs to the segment that just ended. Streaming STT is unaffected: it sends FINAL before END_OF_SPEECH, so `_user_turn_committed` is still False at the final and the existing END_OF_SPEECH commit path is used. Generated with [Linear](https://linear.app/livekit/issue/AGT-3051/make-non-streaming-stt-work-for-turn-detection#agent-session-016ed31f) Co-authored-by: linear-code[bot] <222613912+linear-code[bot]@users.noreply.github.com>

chenghao-mou force-pushed the chenghaomuo/agt-3051-make-non-streaming-stt-work-for-turn-detection-9920 branch from 5462a20 to 95b863a Compare June 26, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(voice): make non-streaming STT work for turn detection#6240

fix(voice): make non-streaming STT work for turn detection#6240
chenghao-mou wants to merge 1 commit into
mainfrom
chenghaomuo/agt-3051-make-non-streaming-stt-work-for-turn-detection-9920

chenghao-mou commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chenghao-mou commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenghao-mou commented Jun 26, 2026 •

edited

Loading