fix(voice): make non-streaming STT work for turn detection#6240
Draft
chenghao-mou wants to merge 1 commit into
Draft
fix(voice): make non-streaming STT work for turn detection#6240chenghao-mou wants to merge 1 commit into
chenghao-mou wants to merge 1 commit into
Conversation
β¦rn detection In turn_detection="stt" mode, a non-streaming STT (e.g. wrapped by stt.StreamAdapter, including a FallbackAdapter failing over to a non-streaming provider like gpt-4o-mini-transcribe) emits END_OF_SPEECH *before* recognize() returns the transcript. The STT-end commit path therefore runs with an empty transcript and bails, leaving the final transcript to commit via the FINAL_TRANSCRIPT handler β but that path is gated on `not self._speaking`. When recognize() is slow and VAD re-detects speech during that latency window, `_speaking` is True again by the time the final lands, so the turn is never committed and stalls, even though the transcript arrived (AGT-3051). Commit the final regardless of `_speaking` once end-of-turn was already signalled for the segment (`_user_turn_committed`); the transcript belongs to the segment that just ended. Streaming STT is unaffected: it sends FINAL before END_OF_SPEECH, so `_user_turn_committed` is still False at the final and the existing END_OF_SPEECH commit path is used. Generated with [Linear](https://linear.app/livekit/issue/AGT-3051/make-non-streaming-stt-work-for-turn-detection#agent-session-016ed31f) Co-authored-by: linear-code[bot] <222613912+linear-code[bot]@users.noreply.github.com>
5462a20 to
95b863a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
turn_detection="stt"mode, a non-streaming STT (e.g. wrapped bystt.StreamAdapterβ including aFallbackAdapterfailing over to a non-streaming provider likegpt-4o-mini-transcribe) emitsEND_OF_SPEECHbeforerecognize()returns the transcript. So the STT-end commit path runs with an empty transcript and bails, and the final transcript is left to commit via theFINAL_TRANSCRIPThandler β which is gated onnot self._speaking.When
recognize()is slow and VAD re-detects speech during that latency window,_speakingisTrueagain by the time the final lands, so the turn is never committed and stalls β even though the transcript did arrive. This matches the trace: the transcript reached theuser_turn/FallbackAdapterspan, VAD was still detecting speech, and the turn never progressed.Fix
Commit the final transcript regardless of
_speakingonce end-of-turn was already signalled for the segment (_user_turn_committed) β the transcript belongs to the segment that just ended. Streaming STT is unaffected: it sendsFINALbeforeEND_OF_SPEECH, so_user_turn_committedis stillFalseat the final and the existingEND_OF_SPEECHcommit path is used.Tests
test_late_stt_final_commits_turn_when_vad_redetects_speechβ reproduces the stall (StreamAdapter ordering + VAD re-detecting speech during recognize) and verifies the turn now commits.test_stt_final_while_speaking_does_not_commit_without_end_of_speechβ regression guard: a final while still speaking, with no end-of-turn signalled, must not commit.