fix(voice): resume speech interrupted before its first frame (#1909)#1910
Open
enriqueespaillat-gyde wants to merge 3 commits into
Open
fix(voice): resume speech interrupted before its first frame (#1909)#1910enriqueespaillat-gyde wants to merge 3 commits into
enriqueespaillat-gyde wants to merge 3 commits into
Conversation
…#1909) Port of livekit/agents#5039 (Python issue #5038) to agents-js. When the agent is in the "thinking" state and the user makes a brief sound before the first TTS audio frame is forwarded, `onStartOfSpeech` pauses the not-yet-playing speech. That thinking-state pause is intentional and is preserved. The frames are still captured into the paused output buffer, but `forwardAudio`'s finally block rejected `firstFrameFut` (and removed its PLAYBACK_STARTED listener) whenever no frame had played yet. When a false interruption then cleared and the output resumed, the buffered first frame played but nothing was listening, so the future stayed rejected. Because the reply tasks gate transcript preservation on `firstFrameFut.done && !firstFrameFut.rejected`, the resumed turn was dropped from history even though audio reached the user. Fix: - Move the PLAYBACK_STARTED listener from `forwardAudio` to `performAudioForwarding` so it outlives the forwarding task; a late first frame (e.g. after a `resumeFalseInterruption` resume) can still resolve `firstFrameFut`. - Stop rejecting `firstFrameFut` in `forwardAudio`'s finally. - Settle the future in the reply tasks (say / pipeline / realtime) after playout finishes or is interrupted, which also removes the listener. JS note: unlike Python, the JS `Future` has no `cancel()` distinct from `reject()` (reject sets `rejected = true`), so the fix preserves audio on the no-first-frame path by relocating resolution rather than relying on a cancel/reject distinction in the downstream gate. Adds a regression test reproducing the thinking-state pause before the first frame for both a false interruption (resumes and plays, transcript preserved) and a genuine interruption after a resume (partial transcript kept, turn not lost). Both fail on main and pass with this change.
🦋 Changeset detectedLatest commit: 2e087f7 The changes in this PR will be included in the next version bump. This PR includes changesets to release 35 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Wrap the post-forwarding body of ttsTask in try/finally so settleFirstFrameFut runs even if an await throws between the performAudioForwarding call and the end of the method. Otherwise the PLAYBACK_STARTED listener registered in performAudioForwarding would leak on the shared audioOutput EventEmitter on the exception path. Mirrors the finally-block pattern already used in forwardSegment and processOneMessage.
…#1909) Use the bare livekit#1909 short form referenced once at the core fix points, matching existing comments (e.g. livekit#1662, livekit#1430, livekit#1124), instead of the verbose cross-repo 'livekit#1909 (port of livekit/agents#5039)' form and the per-call-site repetition. The port context lives in the commit/PR/changeset.
enriqueespaillat-gyde
left a comment
Contributor
Author
There was a problem hiding this comment.
r? @toubatbrian - please let me know if I should stop tagging you and if theres another process to follow :) There's been an uptick in PRs past few months so just want to make sure you see this one. Its hit us a few times in production.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port of livekit/agents#5039 (Python issue livekit/agents#5038) to
agents-js.When the agent is in the "thinking" state and the user makes a brief sound before the first TTS audio frame is forwarded, the speech is silently dropped: no audio reaches the user and the turn is dropped from conversation history. This happens with
resumeFalseInterruption: true(the default).Closes #1909.
Root cause
Two behaviors combine in
agents/src/voice/:agent_activity.ts→onStartOfSpeechpauses the current speech when the agent is not yet"speaking"(thinking state). This pause is intentional — it stops the agent talking over a user who barges in during the thinking state — and is preserved by this PR.generation.ts→forwardAudioregistered itsPLAYBACK_STARTEDlistener inside the forwarding task and, in itsfinally, rejectedfirstFrameFutwhenever no frame had played. The thinking-state pause buffers the (short) TTS frames; the forwarding task then finishes before playback starts →firstFrameFutis rejected and the listener removed. When the false interruption clears and the output resumes, the buffered first frame plays but nothing is listening, so the future stays rejected. The reply tasks gate transcript preservation onfirstFrameFut.done && !firstFrameFut.rejected, so the resumed turn is blanked from history even though audio reached the user.Fix
Mirrors the spirit of #5039 — make the pre-first-frame pause recoverable while keeping the thinking-state pause:
PLAYBACK_STARTEDlistener fromforwardAudiotoperformAudioForwardingso it outlives the forwarding task; a late first frame (e.g. after aresumeFalseInterruptionresume) can still resolvefirstFrameFut.firstFrameFutinforwardAudio'sfinally.JS-specific note
Unlike Python, the JS
Future(agents/src/utils.ts) has nocancel()distinct fromreject()—reject()setsrejected = true. So a literal "cancel instead of reject" transcription of #5039 would not change the downstream!firstFrameFut.rejectedgate. This fix preserves the audio/transcript on the no-first-frame path by relocating resolution so the late first frame resolves the future, rather than relying on a cancel/reject distinction.Tests
New regression test
generation_interrupt_before_first_frame.test.tsreproduces the thinking-state pause before the first frame with a pausable mock output:firstFrameFutresolves, and the synchronized transcript would be preserved.Both cases fail on
mainand pass with this change. The existinggeneration_tts_timeout.test.ts"ignores PLAYBACK_STARTED from another segment" assertion is updated to the new contract (forwardAudiono longer rejects the future).agentstest suite: green (0 failed).pnpm build, ESLint, Prettier: green.