fish audio support for expressive mode + runtime update_expressive() by cshape · Pull Request #6232 · livekit/agents

cshape · 2026-06-25T17:59:33Z

brings cale/expressive-fish current with tina/expressive-mode and adds fish audio support plus a runtime expressive setter.

heads up on size: the file count is large only because expressive-fish is ~73 commits behind expressive-mode, so most of the diff is that mainline drift. the actual new work is the handful of files below.

framework (provider-agnostic):

voice/agent.py: add Agent.update_expressive() so an agent can change its expressive setting mid-session, not just at construction. it just assigns _expressive; AgentActivity already re-resolves expressive options per generation, so the change lands on the next turn. symmetric with the existing update_instructions / update_tools setters.
tts/markup_utils.py: harden the shared markup conversion. more permissive expression/sound regexes that tolerate the malformations smaller llms emit (extra trailing attrs, the wrapping form, orphan closing tags) so raw xml never reaches the provider. adds convert_emphasis_to_fish and break-duration parsing.

fish provider:

tts/_provider_format.py: register fishaudio across llm_instructions / convert / strip, plus the casual and customer-service preset bodies.
voice/presets.py: register the fishaudio entries so presets.CASUAL / CUSTOMER_SERVICE resolve for fish.
plugins/fishaudio/tts.py: Markup._provider_key() returns fishaudio only for s2 models; ChunkedStream / SynthesizeStream convert markup to fish brackets before the request.

test plan:

update_expressive(presets.CASUAL) mid-session, confirm the next reply uses it.
run an s2 fish session in expressive mode, confirm tags convert to brackets in audio and are stripped from the transcript.

base instructions: - add a natural-voice preamble (spoken, not written; use contractions; expand numbers/abbreviations; pacing via punctuation + the <break> tag; fillers and self-repairs are part of how real speech sounds) - document intensity modifiers ("slightly sad", "very excited", etc.) as a free-form prefix on any emotion - swap the example set to cover an intensity modifier, two tone markers (whispering, in a hurry tone), two advanced emotions (regretful, hopeful), and an emphasis demonstration - drop the heavy "tag every clause" guidance in favor of "tag every sentence, retag when the feeling shifts, don't stack conflicting ones; reach for the specific emotion over the broadest basic" - add <emphasis>WORD</emphasis> as the fourth tag; converter wires it through convert_emphasis_to_fish() to fish's native [emphasis] WORD marker, and the tag is added to _FISHAUDIO_TAGS so strip_markup removes it from transcripts presets: - customer service: drop the dense per-moment expression map; keep the de-escalation, enunciation, and stay-in-lane bullets - healthcare: tighten the emotion-mapping bullet; add a non-verbal-sounds bullet that hard-suppresses laughing/yawning/snoring/crowd sounds - conversational: rewrite the sound bullet so sighing is last and gated ("ONLY when truly commiserating") instead of the obvious default; split the "pace with punctuation" bullet into a dedicated <break> bullet (with a problem -> reassurance example) and a streamlined punctuation/emphasis bullet; add a dedicated contractions bullet with concrete pairs; bump disfluency target from "zero to two per turn" to "one to two per turn" and add self-repairs ("I, I think") and "for sure"/"a little" to the texture list

smaller models occasionally emit two malformations the original regex couldn't handle. both leaked raw xml through to the tts provider: <expression value="X" empathetic/> -- extra trailing attribute <expression value="X" >content</expression> -- wrapping form, ws before > changes in convert_expression_tags: - _EXPRESSION_RE / _SOUND_RE swap the trailing `\s*` between value and close for `[^>]*?` (lazy, any non-`>`), so trailing attributes get ignored - the wrapping branch now captures inner content under `re.DOTALL` so the substitution emits `[X]content` instead of leaving the tags around it - add _ORPHAN_CLOSE_RE that strips dangling `</expression>` / `</sound>` after conversion, which normalize_markup creates when it rewrites a wrapping opener to self-closing form verified against the two real-world malformations plus the existing self-closing and well-formed wrapping cases. shared with inworld and elevenlabs v3 since they route through the same convert_expression_tags.

Fold the sensitivity/softening guidance from the healthcare preset into the customer service preset and merge HEALTHCARE into CUSTOMER_SERVICE in the registry so presets.HEALTHCARE falls back to the agnostic default for Fish Audio.

Rename across the Preset enum, the public presets.CASUAL constant, the per-provider registry, and the _<provider>_CASUAL preset bodies. Example agents updated to use presets.CASUAL.

fish audio: refine customer service preset, drop healthcare

…e/expressive-fish # Conflicts: # examples/inference/agent.py # examples/survey/agent.py # livekit-agents/livekit/agents/tts/_provider_format.py # livekit-agents/livekit/agents/voice/presets.py

gpt-5.4-class models sat at the low end of the casual guidance (0 sounds, ~1 filler/turn). Make casual lean in: most turns should carry a non-verbal sound and 2-3 disfluencies (fillers, false starts, self-repairs, light stutters), since that texture is what sells a real casual voice.

gpt-5.4-class models collapse casual onto happy/curious expressions and chuckling every turn, which reads repetitive. Call those out as lazy defaults and push for the full expression range + sound variety (no chuckling on consecutive turns).

…-corrections) Models still produced clean, polished casual sentences. Frame casual as UNSCRIPTED in the persona line and rewrite the disfluency guideline to demand a real hesitation or self-correction every turn, calling polished first-try sentences the failure mode.

…400) Gemma 4 controls reasoning via thinking_level (only 'minimal'/'high' valid), not thinking_budget — the plugin routed it into the gemini-2.5 budget branch, so thinking_budget=0 returned 400 and thinking_level was rejected. Detect gemma-4 and route it through the thinking_level path, defaulting to 'minimal' (off). With thinking off Gemma 4 31B drops from ~20s to ~1.3s and stops leaking its planning into the response.

…_budget 400)" This reverts commit 7236a51.

The tag reference enumerated 49 emotions + 7 example lines (~1k tokens) and was injected on every LLM call. Cut to ~17 representative emotions (the plain-English escape hatch covers the rest), tighter prose, and 3 examples — keeping the disfluency guidance, tone markers, sounds, and emphasis intact.

Tighten and trim both Fish presets: leaner shared guide, casual leans harder into unscripted disfluency (hesitations/false-starts/self-corrections) with a focused tag set, professional kept composed. Fix a few malformed example tags.

…e/expressive-fish

…isfluency/sounds - Shared _FISHAUDIO_LLM_INSTRUCTIONS: add an explicit PUNCTUATION rule forbidding em/en dashes in spoken output (use commas, periods, or <break/>); applies to both registers. - Casual: crank disfluencies to 4-5/turn (incl. real hesitations/self-corrections), make laughing/clear-throat frequent (>=1 sound most turns), add heavy few-shot examples, and de-dash the casual prose so it stops modeling em dashes. - Customer-service (professional) left unchanged.

De-dash the three Fish Audio blocks (shared tag reference, customer-service, casual) so the prompt never models em dashes the LLM then mimics in speech. The no-dash rule itself now names the characters by description instead of printing them. Inworld/Cartesia/ElevenLabs untouched.

CLAassistant · 2026-06-25T18:06:26Z

All committers have signed the CLA.

…ssions casual preset: require an <expression> tag per sentence (same emotion to continue or a new one), push to change emotions quickly for the full range, and immediately mirror a shift to a sad topic instead of staying upbeat. adds a sad example. markup: add convert_expression_to_fish, which prepends 'very' to each <expression> value for fish audio (e.g. <expression value="regretful"/> -> [very regretful]); sounds pass through unchanged, an already-'very' value isn't doubled, and other providers keep using convert_expression_tags.

…rsers Resolved conflicts in favor of our tuned _FISHAUDIO_LLM_INSTRUCTIONS (9-emotion guide, no-em-dash rule, emphasis tag, per-sentence tagging) and our markup parsers (convert_expression_to_fish for the '[very ...]' intensifier, convert_emphasis_to_fish). Also removed fishaudio from the convert_expression_tags dispatch so the intensifier isn't pre-stripped, and deduped the fishaudio key the merge added to the presets registry.

- customer-service: add the missing space after 'fits the moment.' so it no longer glues to 'Keep a gentle...'. - casual: add a newline after the language-switch bullet so it no longer merges into the 'sad topic' bullet.

… soft-tone example The expressive-fish merge left a second copy of _FISHAUDIO_CUSTOMER_SERVICE and _FISHAUDIO_CASUAL later in the file. By last-write-wins those stale copies shadowed our tuned presets (so Fish users got the old prompts). Removed the duplicates so the tuned presets at the top are the live ones. Also changed the off-list <expression value="in a soft tone"/> in the shared guide example to value="sad" (within the allowed emotion set).

the converter only maps <break time=...> to fish pause markers; a bare <break/> would leak as raw xml. point the instruction at the timed form.

soft 'a couple' language drifted hard by model: near-zero on gpt-5.1, a flood on gemma. a hard min-1/max-3 count with at most one stutter lands both in the same casual band, and travels to other models better than vague quantifiers. examples retuned to match the quota (one disfluency each, sad case uses a gentle hesitation not a stutter).

…eview) theo: provider-specific helpers belong with the provider-format logic. move the three fish converters (the expression 'very' intensifier, break, and emphasis) plus their fish-only regexes into _provider_format. keep the generic helpers (convert_expression_tags, convert_break_to_ellipsis, strip_xml_tags, strip_bracket_tags) in markup_utils, which other modules and the tokenizer tests still import; the fish expression converter reuses the shared regexes.

tinalenguyen and others added 18 commits June 16, 2026 03:10

wip

e2271c2

presets: rename CONVERSATIONAL preset to CASUAL

2e95f74

Rename across the Preset enum, the public presets.CASUAL constant, the per-provider registry, and the _<provider>_CASUAL preset bodies. Example agents updated to use presets.CASUAL.

Merge pull request #1 from tinalenguyen/tina/fish-presets

c0fa809

fish audio: refine customer service preset, drop healthcare

Merge remote-tracking branch 'upstream/tina/expressive-mode' into cal…

b15cde2

…e/expressive-fish # Conflicts: # examples/inference/agent.py # examples/survey/agent.py # livekit-agents/livekit/agents/tts/_provider_format.py # livekit-agents/livekit/agents/voice/presets.py

Revert "google plugin: support Gemma 4 thinking_level (fixes thinking…

9bb4162

…_budget 400)" This reverts commit 7236a51.

Merge remote-tracking branch 'upstream/tina/expressive-mode' into cal…

661aafb

…e/expressive-fish

Drop the fish console example from the PR branch

058a148

ruff format _provider_format.py to satisfy CI

fff0fa3

This comment was marked as resolved.

Sign in to view

tinalenguyen force-pushed the tina/expressive-fish branch from e2271c2 to cf880cf Compare June 25, 2026 20:01

tinalenguyen requested a review from a team as a code owner June 25, 2026 20:01

theomonnom approved these changes Jun 25, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

cshape added 2 commits June 25, 2026 17:08

fish presets: fix two glued string literals flagged in PR review

fa4fca3

- customer-service: add the missing space after 'fits the moment.' so it no longer glues to 'Keep a gentle...'. - casual: add a newline after the language-switch bullet so it no longer merges into the 'sad topic' bullet.

This comment was marked as resolved.

Sign in to view

fish instructions: use <break time> not bare <break/> (PR review)

a28ecbd

the converter only maps <break time=...> to fish pause markers; a bare <break/> would leak as raw xml. point the instruction at the timed form.

theomonnom reviewed Jun 26, 2026

View reviewed changes

Comment thread livekit-agents/livekit/agents/tts/markup_utils.py Outdated

This comment was marked as resolved.

Sign in to view

cshape force-pushed the cale/expressive-fish branch from cb72777 to c0d48eb Compare June 26, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fish audio support for expressive mode + runtime update_expressive()#6232

fish audio support for expressive mode + runtime update_expressive()#6232
cshape wants to merge 26 commits into
livekit:tina/expressive-fishfrom
cshape:cale/expressive-fish

cshape commented Jun 25, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 25, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

cshape commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cshape commented Jun 25, 2026 •

edited

Loading

CLAassistant commented Jun 25, 2026 •

edited

Loading