Skip to content

fix(voice): make non-streaming STT work for turn detection#6240

Draft
chenghao-mou wants to merge 1 commit into
mainfrom
chenghaomuo/agt-3051-make-non-streaming-stt-work-for-turn-detection-9920
Draft

fix(voice): make non-streaming STT work for turn detection#6240
chenghao-mou wants to merge 1 commit into
mainfrom
chenghaomuo/agt-3051-make-non-streaming-stt-work-for-turn-detection-9920

Conversation

@chenghao-mou

@chenghao-mou chenghao-mou commented Jun 26, 2026

Copy link
Copy Markdown
Member

In turn_detection="stt" mode, a non-streaming STT (e.g. wrapped by stt.StreamAdapter β€” including a FallbackAdapter failing over to a non-streaming provider like gpt-4o-mini-transcribe) emits END_OF_SPEECH before recognize() returns the transcript. So the STT-end commit path runs with an empty transcript and bails, and the final transcript is left to commit via the FINAL_TRANSCRIPT handler β€” which is gated on not self._speaking.

When recognize() is slow and VAD re-detects speech during that latency window, _speaking is True again by the time the final lands, so the turn is never committed and stalls β€” even though the transcript did arrive. This matches the trace: the transcript reached the user_turn/FallbackAdapter span, VAD was still detecting speech, and the turn never progressed.

Fix

Commit the final transcript regardless of _speaking once end-of-turn was already signalled for the segment (_user_turn_committed) β€” the transcript belongs to the segment that just ended. Streaming STT is unaffected: it sends FINAL before END_OF_SPEECH, so _user_turn_committed is still False at the final and the existing END_OF_SPEECH commit path is used.

Tests

  • test_late_stt_final_commits_turn_when_vad_redetects_speech β€” reproduces the stall (StreamAdapter ordering + VAD re-detecting speech during recognize) and verifies the turn now commits.
  • test_stt_final_while_speaking_does_not_commit_without_end_of_speech β€” regression guard: a final while still speaking, with no end-of-turn signalled, must not commit.

…rn detection

In turn_detection="stt" mode, a non-streaming STT (e.g. wrapped by
stt.StreamAdapter, including a FallbackAdapter failing over to a
non-streaming provider like gpt-4o-mini-transcribe) emits END_OF_SPEECH
*before* recognize() returns the transcript. The STT-end commit path
therefore runs with an empty transcript and bails, leaving the final
transcript to commit via the FINAL_TRANSCRIPT handler β€” but that path is
gated on `not self._speaking`.

When recognize() is slow and VAD re-detects speech during that latency
window, `_speaking` is True again by the time the final lands, so the
turn is never committed and stalls, even though the transcript arrived
(AGT-3051).

Commit the final regardless of `_speaking` once end-of-turn was already
signalled for the segment (`_user_turn_committed`); the transcript
belongs to the segment that just ended. Streaming STT is unaffected: it
sends FINAL before END_OF_SPEECH, so `_user_turn_committed` is still
False at the final and the existing END_OF_SPEECH commit path is used.

Generated with [Linear](https://linear.app/livekit/issue/AGT-3051/make-non-streaming-stt-work-for-turn-detection#agent-session-016ed31f)

Co-authored-by: linear-code[bot] <222613912+linear-code[bot]@users.noreply.github.com>
@chenghao-mou chenghao-mou force-pushed the chenghaomuo/agt-3051-make-non-streaming-stt-work-for-turn-detection-9920 branch from 5462a20 to 95b863a Compare June 26, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant