Skip to content

fix(agents): persist _speech_start_time across intra-turn VAD bursts#5585

Open
AlessandroElyos wants to merge 5 commits intolivekit:mainfrom
AlessandroElyos:alessandro/speech-start-time-persistence
Open

fix(agents): persist _speech_start_time across intra-turn VAD bursts#5585
AlessandroElyos wants to merge 5 commits intolivekit:mainfrom
AlessandroElyos:alessandro/speech-start-time-persistence

Conversation

@AlessandroElyos
Copy link
Copy Markdown
Contributor

Within a single user turn, VAD can fire multiple START_OF_SPEECH/END_OF_SPEECH cycles separated by short silences (e.g. the user says "Hello." then pauses briefly before continuing). End-of-turn detection is decoupled from VAD — a turn is only considered ended once _bounce_eou_task runs and clears the per-turn state.

_vad_speech_started was being reset to False in the END_OF_SPEECH branch of _on_vad_event. The next START_OF_SPEECH within the same turn would therefore overwrite _speech_start_time with the latest burst's start, losing the original turn-start timestamp. This propagates to started_speaking_at on the EOT metrics report, which downstream consumers (recording-to-transcript alignment, analytics) rely on to know when the user actually began speaking.

The fix removes the _vad_speech_started = False reset from the END_OF_SPEECH branch. The flag is now cleared only by the EOT cleanup in _bounce_eou_task — which is already the lifecycle owner of _speech_start_time — so both fields are governed symmetrically by turn boundaries rather than burst boundaries.

_vad_speech_started is read at exactly one site (the SOS guard) and is also defensively reset in the VAD task finally block on handoff/teardown, so removing the EOS reset has no other side effects.

Adds tests/test_speech_start_time_persistence.py with three cases following the existing test_audio_recognition_aclose.py style:

  • test_first_sos_sets_speech_start_time — sanity check.
  • test_speech_start_time_persists_across_intra_turn_bursts — the bug reproducer; fails before the fix, passes after.
  • test_eos_does_not_clear_vad_speech_started — encodes the new invariant.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@AlessandroElyos
Copy link
Copy Markdown
Contributor Author

AlessandroElyos commented Apr 28, 2026

The failing test_interruption_before_speaking[True-3.5] looks like a pre-existing flake unrelated to this PR — it's a timing-sensitive assertion (5× speed factor, 0.75s tolerance on check_timestamp). I ran it 3× locally with the fix applied and it passed each time. Happy to fix it if it can help

with trace.use_span(self._ensure_user_turn_span()):
self._hooks.on_end_of_speech(ev)

self._vad_speech_started = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes the _speech_start_time always the turn start time, may break the tracing that we create user_speaking spans using this timestamp as the start of the span (ref).

using the start time of the user_turn_span could be a better solution if the goal is to make started_speaking_at in the EOT info the first time user started speaking. we already have _ensure_user_turn_span that create the span only when there is no active user turn span, and the span is ended only after EOT committed.

the OTEL span is designed for write-only, so maybe add a _user_turn_start alongside the user_turn_span.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback @longcw.

I reverted the change on _vad_speech_started.

I also added a _user_turn_start variable as you suggested that should reflect the actual turn boundaries.

Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants