Skip to content

Fix stale playout state across multi-step voice turns#5533

Open
jayeshp19 wants to merge 3 commits intolivekit:mainfrom
jayeshp19:Fix-stale-playout-state-across-multi-step-voice-turns
Open

Fix stale playout state across multi-step voice turns#5533
jayeshp19 wants to merge 3 commits intolivekit:mainfrom
jayeshp19:Fix-stale-playout-state-across-multi-step-voice-turns

Conversation

@jayeshp19
Copy link
Copy Markdown
Contributor

@jayeshp19 jayeshp19 commented Apr 23, 2026

This PR fixes a bug where paused-speech state could become stale across generation steps within a single SpeechHandle. The key changes are:

  1. New current_generation_playout_active flag on SpeechHandle — tracks whether the current generation's audio is actively playing, reset on each step advance.
  2. New generation_step field on _PausedSpeechInfo — ensures paused speech is only applied/resumed if it matches the current generation step.
  3. Centralized _on_generation_playout_started / _on_generation_playout_finished — extracts duplicated playout lifecycle logic (state updates, audio recognition notifications, interruption toggling) into two reusable methods.

The audio-interruption condition now also gates on current_generation_playout_active, preventing interruptions when the agent isn't actually playing audio for the current generation.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@longcw
Copy link
Copy Markdown
Contributor

longcw commented Apr 27, 2026

paused-speech state could become stale across generation steps within a single SpeechHandle

do you mean a tool reply created after a speech was paused, the paused_speech is still the same as the new tool reply? can you share an example to demonstrate the issue it caused?

@jayeshp19 jayeshp19 force-pushed the Fix-stale-playout-state-across-multi-step-voice-turns branch from 8c7fa33 to 741c6cf Compare April 28, 2026 18:50
@jayeshp19
Copy link
Copy Markdown
Contributor Author

paused-speech state could become stale across generation steps within a single SpeechHandle

do you mean a tool reply created after a speech was paused, the paused_speech is still the same as the new tool reply? can you share an example to demonstrate the issue it caused?

Yes, that is the case.

A specific interleaving example is:

  1. User asks “what’s the weather in Tokyo?”
  2. The model emits only a function call, with no spoken preamble/audio.
  3. While that silent tool-call generation is still active, user speech/noise starts, so false-interruption handling records _paused_speech for the current SpeechHandle with agent_state="thinking".
  4. The same SpeechHandle advances to the tool-reply generation.
  5. The tool reply starts playing, and pause handling runs again while the user is still speaking.

Before this change, _PausedSpeechInfo was keyed only by SpeechHandle, so _update_paused_speech() treated the later tool-reply pause as the same paused speech and only updated the timeout. It kept the metadata captured during the earlier silent/tool-call step.

The visible effect in the regression test is that false-interruption resume transitions the agent from listening back to thinking while the tool-reply audio resumes, instead of resuming to speaking.

With generation-step tracking, the tool-reply pause refreshes the paused state for the current generation, so resume restores the correct state.

I rebased the PR and added a regression test for this specific interleaving.

https://github.com/livekit/agents/pull/5533/changes#diff-8fc06608b976184d49adbab8eccbd212aa7252e3e17cc5d9ec34dced8365f217R833

@longcw
Copy link
Copy Markdown
Contributor

longcw commented Apr 29, 2026

@jayeshp19 thanks for the details! and yeah I see the issue, it could be fixed in a simple way that we reset the paused speech in _scheduling_task after the current generation _wait_for_generation. I created a alternative pr in #5594

@jayeshp19
Copy link
Copy Markdown
Contributor Author

jayeshp19 commented Apr 29, 2026

@jayeshp19 thanks for the details! and yeah I see the issue, it could be fixed in a simple way that we reset the paused speech in _scheduling_task after the current generation _wait_for_generation. I created a alternative pr in #5594

Thanks, it makes sense for the bug I was trying to fix.

One small note: I also tested a edge case where the false-interruption timer fires before the silent tool-call generation finishes. In that case the runtime can still do a thinking -> listening -> thinking false-resume during the silent step itself. That seems like a separate edge case around pausing while no audio is actually playing, and my current PR does not fully solve that either because the pause-on-thinking path can still record _paused_speech.

So I’m good with #5594 superseding this PR for the cross-step stale-state bug. I can close this PR once #5594 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants