feat(room-io): add json_format option for timed transcription output#5472
feat(room-io): add json_format option for timed transcription output#5472
Conversation
Adds `json_format` to `TextOutputOptions` so the transcription stream on the `lk.transcription` topic emits each chunk as a JSON object with `text` and optional `start_time`/`end_time` fields when the chunk is a `TimedString`. This makes it easier for clients to consume TTS-aligned timed transcripts.
chenghao-mou
left a comment
There was a problem hiding this comment.
lgtm. one small question.
| ts_pb.confidence = text.confidence | ||
| if utils.is_given(text.start_time_offset): | ||
| ts_pb.start_time_offset = text.start_time_offset | ||
| text = json.dumps(MessageToDict(ts_pb, preserving_proto_field_name=True)) + "\n" |
There was a problem hiding this comment.
should we use always_print_fields_with_no_presence so keys are always present?
There was a problem hiding this comment.
perhaps no, if the text is not a TimeString we may not want start_time or end_time be included in the dict.
| stt=inference.STT("deepgram/nova-3"), | ||
| llm=inference.LLM("google/gemini-2.5-flash"), | ||
| tts=inference.TTS("cartesia/sonic-3"), | ||
| tts=cartesia.TTS(), |
There was a problem hiding this comment.
does inference not support this? if not we should let the team know.
There was a problem hiding this comment.
we do have these options to enable the timestamp in tts inference (added in #4949), but it seems there is no timestamps returned when enabled. will forward to the team.
| self._out_ch.send_nowait( | ||
| TimedString(word, end_time=time.time() - self._start_wall_time) | ||
| ) |
There was a problem hiding this comment.
🔴 TimedString end_time does not subtract _paused_duration, producing incorrect timestamps after pause/resume
In _main_task, the newly added TimedString objects compute end_time as time.time() - self._start_wall_time, but fail to subtract self._paused_duration. The synchronization delay calculation on synchronizer.py:337 correctly uses elapsed = time.time() - self._start_wall_time - self._paused_duration, but the end_time written to the output TimedString at lines 332 and 367 omits this subtraction. When audio playback is paused and resumed (e.g., during barge-in via _SyncedAudioOutput.pause() at synchronizer.py:613-618), the reported end_time will be inflated by the total pause duration, producing incorrect timing data for downstream consumers like the JSON format transcription output.
| self._out_ch.send_nowait( | |
| TimedString(word, end_time=time.time() - self._start_wall_time) | |
| ) | |
| self._out_ch.send_nowait( | |
| TimedString(word, end_time=time.time() - self._start_wall_time - self._paused_duration) | |
| ) |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
this is intentional, we should include paused time in the timestamp from synchronizer, it's the actual sent time of a transcript.
|
🤖 This is an automated Claude Code routine created by @toubatbrian. Right now it is in experimentation stage. This PR looks like a core runtime improvement ( Generated by Claude Code |
|
🤖 Port opened: livekit/agents-js#1305 Generated by Claude Code |
Summary
json_formatfield toTextOutputOptionsfor the room text output chainlk.transcriptiondatastream topic is a JSON object withtext, andstart_time/end_timeif the chunk is aTimedStringneeds livekit/protocol#1502