fix(agents): persist _speech_start_time across intra-turn VAD bursts#5585
fix(agents): persist _speech_start_time across intra-turn VAD bursts#5585AlessandroElyos wants to merge 5 commits intolivekit:mainfrom
Conversation
|
The failing |
| with trace.use_span(self._ensure_user_turn_span()): | ||
| self._hooks.on_end_of_speech(ev) | ||
|
|
||
| self._vad_speech_started = False |
There was a problem hiding this comment.
this makes the _speech_start_time always the turn start time, may break the tracing that we create user_speaking spans using this timestamp as the start of the span (ref).
using the start time of the user_turn_span could be a better solution if the goal is to make started_speaking_at in the EOT info the first time user started speaking. we already have _ensure_user_turn_span that create the span only when there is no active user turn span, and the span is ended only after EOT committed.
the OTEL span is designed for write-only, so maybe add a _user_turn_start alongside the user_turn_span.
There was a problem hiding this comment.
thanks for the feedback @longcw.
I reverted the change on _vad_speech_started.
I also added a _user_turn_start variable as you suggested that should reflect the actual turn boundaries.
Let me know your thoughts.
Within a single user turn, VAD can fire multiple
START_OF_SPEECH/END_OF_SPEECHcycles separated by short silences (e.g. the user says "Hello." then pauses briefly before continuing). End-of-turn detection is decoupled from VAD — a turn is only considered ended once_bounce_eou_taskruns and clears the per-turn state._vad_speech_startedwas being reset toFalsein theEND_OF_SPEECHbranch of_on_vad_event. The nextSTART_OF_SPEECHwithin the same turn would therefore overwrite_speech_start_timewith the latest burst's start, losing the original turn-start timestamp. This propagates tostarted_speaking_aton the EOT metrics report, which downstream consumers (recording-to-transcript alignment, analytics) rely on to know when the user actually began speaking.The fix removes the
_vad_speech_started = Falsereset from theEND_OF_SPEECHbranch. The flag is now cleared only by the EOT cleanup in_bounce_eou_task— which is already the lifecycle owner of_speech_start_time— so both fields are governed symmetrically by turn boundaries rather than burst boundaries._vad_speech_startedis read at exactly one site (the SOS guard) and is also defensively reset in the VAD taskfinallyblock on handoff/teardown, so removing the EOS reset has no other side effects.Adds
tests/test_speech_start_time_persistence.pywith three cases following the existingtest_audio_recognition_aclose.pystyle:test_first_sos_sets_speech_start_time— sanity check.test_speech_start_time_persists_across_intra_turn_bursts— the bug reproducer; fails before the fix, passes after.test_eos_does_not_clear_vad_speech_started— encodes the new invariant.