fish audio support for expressive mode + runtime update_expressive()#6232
Open
cshape wants to merge 26 commits into
Open
fish audio support for expressive mode + runtime update_expressive()#6232cshape wants to merge 26 commits into
cshape wants to merge 26 commits into
Conversation
base instructions:
- add a natural-voice preamble (spoken, not written; use contractions; expand
numbers/abbreviations; pacing via punctuation + the <break> tag; fillers and
self-repairs are part of how real speech sounds)
- document intensity modifiers ("slightly sad", "very excited", etc.) as a
free-form prefix on any emotion
- swap the example set to cover an intensity modifier, two tone markers
(whispering, in a hurry tone), two advanced emotions (regretful, hopeful),
and an emphasis demonstration
- drop the heavy "tag every clause" guidance in favor of "tag every sentence,
retag when the feeling shifts, don't stack conflicting ones; reach for the
specific emotion over the broadest basic"
- add <emphasis>WORD</emphasis> as the fourth tag; converter wires it through
convert_emphasis_to_fish() to fish's native [emphasis] WORD marker, and the
tag is added to _FISHAUDIO_TAGS so strip_markup removes it from transcripts
presets:
- customer service: drop the dense per-moment expression map; keep the
de-escalation, enunciation, and stay-in-lane bullets
- healthcare: tighten the emotion-mapping bullet; add a non-verbal-sounds
bullet that hard-suppresses laughing/yawning/snoring/crowd sounds
- conversational: rewrite the sound bullet so sighing is last and gated
("ONLY when truly commiserating") instead of the obvious default; split the
"pace with punctuation" bullet into a dedicated <break> bullet (with a
problem -> reassurance example) and a streamlined punctuation/emphasis
bullet; add a dedicated contractions bullet with concrete pairs; bump
disfluency target from "zero to two per turn" to "one to two per turn" and
add self-repairs ("I, I think") and "for sure"/"a little" to the texture
list
smaller models occasionally emit two malformations the original regex couldn't handle. both leaked raw xml through to the tts provider: <expression value="X" empathetic/> -- extra trailing attribute <expression value="X" >content</expression> -- wrapping form, ws before > changes in convert_expression_tags: - _EXPRESSION_RE / _SOUND_RE swap the trailing `\s*` between value and close for `[^>]*?` (lazy, any non-`>`), so trailing attributes get ignored - the wrapping branch now captures inner content under `re.DOTALL` so the substitution emits `[X]content` instead of leaving the tags around it - add _ORPHAN_CLOSE_RE that strips dangling `</expression>` / `</sound>` after conversion, which normalize_markup creates when it rewrites a wrapping opener to self-closing form verified against the two real-world malformations plus the existing self-closing and well-formed wrapping cases. shared with inworld and elevenlabs v3 since they route through the same convert_expression_tags.
Fold the sensitivity/softening guidance from the healthcare preset into the customer service preset and merge HEALTHCARE into CUSTOMER_SERVICE in the registry so presets.HEALTHCARE falls back to the agnostic default for Fish Audio.
Rename across the Preset enum, the public presets.CASUAL constant, the per-provider registry, and the _<provider>_CASUAL preset bodies. Example agents updated to use presets.CASUAL.
fish audio: refine customer service preset, drop healthcare
…e/expressive-fish # Conflicts: # examples/inference/agent.py # examples/survey/agent.py # livekit-agents/livekit/agents/tts/_provider_format.py # livekit-agents/livekit/agents/voice/presets.py
gpt-5.4-class models sat at the low end of the casual guidance (0 sounds, ~1 filler/turn). Make casual lean in: most turns should carry a non-verbal sound and 2-3 disfluencies (fillers, false starts, self-repairs, light stutters), since that texture is what sells a real casual voice.
gpt-5.4-class models collapse casual onto happy/curious expressions and chuckling every turn, which reads repetitive. Call those out as lazy defaults and push for the full expression range + sound variety (no chuckling on consecutive turns).
…-corrections) Models still produced clean, polished casual sentences. Frame casual as UNSCRIPTED in the persona line and rewrite the disfluency guideline to demand a real hesitation or self-correction every turn, calling polished first-try sentences the failure mode.
…400) Gemma 4 controls reasoning via thinking_level (only 'minimal'/'high' valid), not thinking_budget — the plugin routed it into the gemini-2.5 budget branch, so thinking_budget=0 returned 400 and thinking_level was rejected. Detect gemma-4 and route it through the thinking_level path, defaulting to 'minimal' (off). With thinking off Gemma 4 31B drops from ~20s to ~1.3s and stops leaking its planning into the response.
…_budget 400)" This reverts commit 7236a51.
The tag reference enumerated 49 emotions + 7 example lines (~1k tokens) and was injected on every LLM call. Cut to ~17 representative emotions (the plain-English escape hatch covers the rest), tighter prose, and 3 examples — keeping the disfluency guidance, tone markers, sounds, and emphasis intact.
Tighten and trim both Fish presets: leaner shared guide, casual leans harder into unscripted disfluency (hesitations/false-starts/self-corrections) with a focused tag set, professional kept composed. Fix a few malformed example tags.
…e/expressive-fish
…isfluency/sounds - Shared _FISHAUDIO_LLM_INSTRUCTIONS: add an explicit PUNCTUATION rule forbidding em/en dashes in spoken output (use commas, periods, or <break/>); applies to both registers. - Casual: crank disfluencies to 4-5/turn (incl. real hesitations/self-corrections), make laughing/clear-throat frequent (>=1 sound most turns), add heavy few-shot examples, and de-dash the casual prose so it stops modeling em dashes. - Customer-service (professional) left unchanged.
De-dash the three Fish Audio blocks (shared tag reference, customer-service, casual) so the prompt never models em dashes the LLM then mimics in speech. The no-dash rule itself now names the characters by description instead of printing them. Inworld/Cartesia/ElevenLabs untouched.
e2271c2 to
cf880cf
Compare
theomonnom
approved these changes
Jun 25, 2026
…ssions casual preset: require an <expression> tag per sentence (same emotion to continue or a new one), push to change emotions quickly for the full range, and immediately mirror a shift to a sad topic instead of staying upbeat. adds a sad example. markup: add convert_expression_to_fish, which prepends 'very' to each <expression> value for fish audio (e.g. <expression value="regretful"/> -> [very regretful]); sounds pass through unchanged, an already-'very' value isn't doubled, and other providers keep using convert_expression_tags.
…rsers Resolved conflicts in favor of our tuned _FISHAUDIO_LLM_INSTRUCTIONS (9-emotion guide, no-em-dash rule, emphasis tag, per-sentence tagging) and our markup parsers (convert_expression_to_fish for the '[very ...]' intensifier, convert_emphasis_to_fish). Also removed fishaudio from the convert_expression_tags dispatch so the intensifier isn't pre-stripped, and deduped the fishaudio key the merge added to the presets registry.
- customer-service: add the missing space after 'fits the moment.' so it no longer glues to 'Keep a gentle...'. - casual: add a newline after the language-switch bullet so it no longer merges into the 'sad topic' bullet.
… soft-tone example The expressive-fish merge left a second copy of _FISHAUDIO_CUSTOMER_SERVICE and _FISHAUDIO_CASUAL later in the file. By last-write-wins those stale copies shadowed our tuned presets (so Fish users got the old prompts). Removed the duplicates so the tuned presets at the top are the live ones. Also changed the off-list <expression value="in a soft tone"/> in the shared guide example to value="sad" (within the allowed emotion set).
the converter only maps <break time=...> to fish pause markers; a bare <break/> would leak as raw xml. point the instruction at the timed form.
theomonnom
reviewed
Jun 26, 2026
soft 'a couple' language drifted hard by model: near-zero on gpt-5.1, a flood on gemma. a hard min-1/max-3 count with at most one stutter lands both in the same casual band, and travels to other models better than vague quantifiers. examples retuned to match the quota (one disfluency each, sad case uses a gentle hesitation not a stutter).
…eview) theo: provider-specific helpers belong with the provider-format logic. move the three fish converters (the expression 'very' intensifier, break, and emphasis) plus their fish-only regexes into _provider_format. keep the generic helpers (convert_expression_tags, convert_break_to_ellipsis, strip_xml_tags, strip_bracket_tags) in markup_utils, which other modules and the tokenizer tests still import; the fish expression converter reuses the shared regexes.
cb72777 to
c0d48eb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
brings cale/expressive-fish current with tina/expressive-mode and adds fish audio support plus a runtime expressive setter.
heads up on size: the file count is large only because expressive-fish is ~73 commits behind expressive-mode, so most of the diff is that mainline drift. the actual new work is the handful of files below.
framework (provider-agnostic):
fish provider:
test plan: