Skip to content

fix(voice): resume speech interrupted before its first frame (#1909)#1910

Open
enriqueespaillat-gyde wants to merge 3 commits into
livekit:mainfrom
enriqueespaillat-gyde:fix/interrupt-before-first-frame
Open

fix(voice): resume speech interrupted before its first frame (#1909)#1910
enriqueespaillat-gyde wants to merge 3 commits into
livekit:mainfrom
enriqueespaillat-gyde:fix/interrupt-before-first-frame

Conversation

@enriqueespaillat-gyde

Copy link
Copy Markdown
Contributor

Summary

Port of livekit/agents#5039 (Python issue livekit/agents#5038) to agents-js.

When the agent is in the "thinking" state and the user makes a brief sound before the first TTS audio frame is forwarded, the speech is silently dropped: no audio reaches the user and the turn is dropped from conversation history. This happens with resumeFalseInterruption: true (the default).

Closes #1909.

Root cause

Two behaviors combine in agents/src/voice/:

  1. agent_activity.tsonStartOfSpeech pauses the current speech when the agent is not yet "speaking" (thinking state). This pause is intentional — it stops the agent talking over a user who barges in during the thinking state — and is preserved by this PR.
  2. generation.tsforwardAudio registered its PLAYBACK_STARTED listener inside the forwarding task and, in its finally, rejected firstFrameFut whenever no frame had played. The thinking-state pause buffers the (short) TTS frames; the forwarding task then finishes before playback starts → firstFrameFut is rejected and the listener removed. When the false interruption clears and the output resumes, the buffered first frame plays but nothing is listening, so the future stays rejected. The reply tasks gate transcript preservation on firstFrameFut.done && !firstFrameFut.rejected, so the resumed turn is blanked from history even though audio reached the user.

Fix

Mirrors the spirit of #5039 — make the pre-first-frame pause recoverable while keeping the thinking-state pause:

  • Move the PLAYBACK_STARTED listener from forwardAudio to performAudioForwarding so it outlives the forwarding task; a late first frame (e.g. after a resumeFalseInterruption resume) can still resolve firstFrameFut.
  • Stop rejecting firstFrameFut in forwardAudio's finally.
  • Settle the future in the reply tasks (say / pipeline / realtime) after playout finishes or is interrupted, which also removes the listener.

JS-specific note

Unlike Python, the JS Future (agents/src/utils.ts) has no cancel() distinct from reject()reject() sets rejected = true. So a literal "cancel instead of reject" transcription of #5039 would not change the downstream !firstFrameFut.rejected gate. This fix preserves the audio/transcript on the no-first-frame path by relocating resolution so the late first frame resolves the future, rather than relying on a cancel/reject distinction.

Tests

New regression test generation_interrupt_before_first_frame.test.ts reproduces the thinking-state pause before the first frame with a pausable mock output:

  • False interruption → speech resumes, frames are forwarded, firstFrameFut resolves, and the synchronized transcript would be preserved.
  • Genuine interruption after a resume → the partial synchronized transcript is kept (turn not lost from history).

Both cases fail on main and pass with this change. The existing generation_tts_timeout.test.ts "ignores PLAYBACK_STARTED from another segment" assertion is updated to the new contract (forwardAudio no longer rejects the future).

  • Full agents test suite: green (0 failed).
  • pnpm build, ESLint, Prettier: green.

…#1909)

Port of livekit/agents#5039 (Python issue #5038) to agents-js.

When the agent is in the "thinking" state and the user makes a brief
sound before the first TTS audio frame is forwarded, `onStartOfSpeech`
pauses the not-yet-playing speech. That thinking-state pause is
intentional and is preserved. The frames are still captured into the
paused output buffer, but `forwardAudio`'s finally block rejected
`firstFrameFut` (and removed its PLAYBACK_STARTED listener) whenever no
frame had played yet. When a false interruption then cleared and the
output resumed, the buffered first frame played but nothing was
listening, so the future stayed rejected. Because the reply tasks gate
transcript preservation on `firstFrameFut.done && !firstFrameFut.rejected`,
the resumed turn was dropped from history even though audio reached the
user.

Fix:
- Move the PLAYBACK_STARTED listener from `forwardAudio` to
  `performAudioForwarding` so it outlives the forwarding task; a late
  first frame (e.g. after a `resumeFalseInterruption` resume) can still
  resolve `firstFrameFut`.
- Stop rejecting `firstFrameFut` in `forwardAudio`'s finally.
- Settle the future in the reply tasks (say / pipeline / realtime) after
  playout finishes or is interrupted, which also removes the listener.

JS note: unlike Python, the JS `Future` has no `cancel()` distinct from
`reject()` (reject sets `rejected = true`), so the fix preserves audio on
the no-first-frame path by relocating resolution rather than relying on a
cancel/reject distinction in the downstream gate.

Adds a regression test reproducing the thinking-state pause before the
first frame for both a false interruption (resumes and plays, transcript
preserved) and a genuine interruption after a resume (partial transcript
kept, turn not lost). Both fail on main and pass with this change.
@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 2e087f7

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 35 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-did Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-soniox Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

Wrap the post-forwarding body of ttsTask in try/finally so
settleFirstFrameFut runs even if an await throws between the
performAudioForwarding call and the end of the method. Otherwise the
PLAYBACK_STARTED listener registered in performAudioForwarding would leak
on the shared audioOutput EventEmitter on the exception path. Mirrors the
finally-block pattern already used in forwardSegment and processOneMessage.
…#1909)

Use the bare livekit#1909 short form referenced once at the core fix points,
matching existing comments (e.g. livekit#1662, livekit#1430, livekit#1124), instead of the
verbose cross-repo 'livekit#1909 (port of livekit/agents#5039)'
form and the per-call-site repetition. The port context lives in the
commit/PR/changeset.

@enriqueespaillat-gyde enriqueespaillat-gyde left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r? @toubatbrian - please let me know if I should stop tagging you and if theres another process to follow :) There's been an uptick in PRs past few months so just want to make sure you see this one. Its hit us a few times in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent speech silently dropped when interrupted before the first audio frame (resumeFalseInterruption) — port of Python #5039

1 participant