Skip to content

fish audio support for expressive mode + runtime update_expressive()#6232

Open
cshape wants to merge 26 commits into
livekit:tina/expressive-fishfrom
cshape:cale/expressive-fish
Open

fish audio support for expressive mode + runtime update_expressive()#6232
cshape wants to merge 26 commits into
livekit:tina/expressive-fishfrom
cshape:cale/expressive-fish

Conversation

@cshape

@cshape cshape commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

brings cale/expressive-fish current with tina/expressive-mode and adds fish audio support plus a runtime expressive setter.

heads up on size: the file count is large only because expressive-fish is ~73 commits behind expressive-mode, so most of the diff is that mainline drift. the actual new work is the handful of files below.

framework (provider-agnostic):

  • voice/agent.py: add Agent.update_expressive() so an agent can change its expressive setting mid-session, not just at construction. it just assigns _expressive; AgentActivity already re-resolves expressive options per generation, so the change lands on the next turn. symmetric with the existing update_instructions / update_tools setters.
  • tts/markup_utils.py: harden the shared markup conversion. more permissive expression/sound regexes that tolerate the malformations smaller llms emit (extra trailing attrs, the wrapping form, orphan closing tags) so raw xml never reaches the provider. adds convert_emphasis_to_fish and break-duration parsing.

fish provider:

  • tts/_provider_format.py: register fishaudio across llm_instructions / convert / strip, plus the casual and customer-service preset bodies.
  • voice/presets.py: register the fishaudio entries so presets.CASUAL / CUSTOMER_SERVICE resolve for fish.
  • plugins/fishaudio/tts.py: Markup._provider_key() returns fishaudio only for s2 models; ChunkedStream / SynthesizeStream convert markup to fish brackets before the request.

test plan:

  • update_expressive(presets.CASUAL) mid-session, confirm the next reply uses it.
  • run an s2 fish session in expressive mode, confirm tags convert to brackets in audio and are stripped from the transcript.

tinalenguyen and others added 18 commits June 16, 2026 03:10
base instructions:
- add a natural-voice preamble (spoken, not written; use contractions; expand
  numbers/abbreviations; pacing via punctuation + the <break> tag; fillers and
  self-repairs are part of how real speech sounds)
- document intensity modifiers ("slightly sad", "very excited", etc.) as a
  free-form prefix on any emotion
- swap the example set to cover an intensity modifier, two tone markers
  (whispering, in a hurry tone), two advanced emotions (regretful, hopeful),
  and an emphasis demonstration
- drop the heavy "tag every clause" guidance in favor of "tag every sentence,
  retag when the feeling shifts, don't stack conflicting ones; reach for the
  specific emotion over the broadest basic"
- add <emphasis>WORD</emphasis> as the fourth tag; converter wires it through
  convert_emphasis_to_fish() to fish's native [emphasis] WORD marker, and the
  tag is added to _FISHAUDIO_TAGS so strip_markup removes it from transcripts

presets:
- customer service: drop the dense per-moment expression map; keep the
  de-escalation, enunciation, and stay-in-lane bullets
- healthcare: tighten the emotion-mapping bullet; add a non-verbal-sounds
  bullet that hard-suppresses laughing/yawning/snoring/crowd sounds
- conversational: rewrite the sound bullet so sighing is last and gated
  ("ONLY when truly commiserating") instead of the obvious default; split the
  "pace with punctuation" bullet into a dedicated <break> bullet (with a
  problem -> reassurance example) and a streamlined punctuation/emphasis
  bullet; add a dedicated contractions bullet with concrete pairs; bump
  disfluency target from "zero to two per turn" to "one to two per turn" and
  add self-repairs ("I, I think") and "for sure"/"a little" to the texture
  list
smaller models occasionally emit two malformations the original regex couldn't
handle. both leaked raw xml through to the tts provider:

  <expression value="X" empathetic/>      -- extra trailing attribute
  <expression value="X" >content</expression>   -- wrapping form, ws before >

changes in convert_expression_tags:

- _EXPRESSION_RE / _SOUND_RE swap the trailing `\s*` between value and close for
  `[^>]*?` (lazy, any non-`>`), so trailing attributes get ignored
- the wrapping branch now captures inner content under `re.DOTALL` so the
  substitution emits `[X]content` instead of leaving the tags around it
- add _ORPHAN_CLOSE_RE that strips dangling `</expression>` / `</sound>` after
  conversion, which normalize_markup creates when it rewrites a wrapping opener
  to self-closing form

verified against the two real-world malformations plus the existing
self-closing and well-formed wrapping cases. shared with inworld and
elevenlabs v3 since they route through the same convert_expression_tags.
Fold the sensitivity/softening guidance from the healthcare preset into
the customer service preset and merge HEALTHCARE into CUSTOMER_SERVICE in
the registry so presets.HEALTHCARE falls back to the agnostic default for
Fish Audio.
Rename across the Preset enum, the public presets.CASUAL constant, the
per-provider registry, and the _<provider>_CASUAL preset bodies. Example
agents updated to use presets.CASUAL.
fish audio: refine customer service preset, drop healthcare
…e/expressive-fish

# Conflicts:
#	examples/inference/agent.py
#	examples/survey/agent.py
#	livekit-agents/livekit/agents/tts/_provider_format.py
#	livekit-agents/livekit/agents/voice/presets.py
gpt-5.4-class models sat at the low end of the casual guidance (0 sounds,
~1 filler/turn). Make casual lean in: most turns should carry a non-verbal
sound and 2-3 disfluencies (fillers, false starts, self-repairs, light
stutters), since that texture is what sells a real casual voice.
gpt-5.4-class models collapse casual onto happy/curious expressions and
chuckling every turn, which reads repetitive. Call those out as lazy
defaults and push for the full expression range + sound variety (no
chuckling on consecutive turns).
…-corrections)

Models still produced clean, polished casual sentences. Frame casual as
UNSCRIPTED in the persona line and rewrite the disfluency guideline to
demand a real hesitation or self-correction every turn, calling polished
first-try sentences the failure mode.
…400)

Gemma 4 controls reasoning via thinking_level (only 'minimal'/'high' valid),
not thinking_budget — the plugin routed it into the gemini-2.5 budget branch,
so thinking_budget=0 returned 400 and thinking_level was rejected. Detect
gemma-4 and route it through the thinking_level path, defaulting to 'minimal'
(off). With thinking off Gemma 4 31B drops from ~20s to ~1.3s and stops
leaking its planning into the response.
The tag reference enumerated 49 emotions + 7 example lines (~1k tokens) and
was injected on every LLM call. Cut to ~17 representative emotions (the
plain-English escape hatch covers the rest), tighter prose, and 3 examples —
keeping the disfluency guidance, tone markers, sounds, and emphasis intact.
Tighten and trim both Fish presets: leaner shared guide, casual leans harder
into unscripted disfluency (hesitations/false-starts/self-corrections) with a
focused tag set, professional kept composed. Fix a few malformed example tags.
…isfluency/sounds

- Shared _FISHAUDIO_LLM_INSTRUCTIONS: add an explicit PUNCTUATION rule forbidding
  em/en dashes in spoken output (use commas, periods, or <break/>); applies to both
  registers.
- Casual: crank disfluencies to 4-5/turn (incl. real hesitations/self-corrections),
  make laughing/clear-throat frequent (>=1 sound most turns), add heavy few-shot
  examples, and de-dash the casual prose so it stops modeling em dashes.
- Customer-service (professional) left unchanged.
De-dash the three Fish Audio blocks (shared tag reference, customer-service,
casual) so the prompt never models em dashes the LLM then mimics in speech.
The no-dash rule itself now names the characters by description instead of
printing them. Inworld/Cartesia/ElevenLabs untouched.
@CLAassistant

CLAassistant commented Jun 25, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@tinalenguyen tinalenguyen force-pushed the tina/expressive-fish branch from e2271c2 to cf880cf Compare June 25, 2026 20:01
@tinalenguyen tinalenguyen requested a review from a team as a code owner June 25, 2026 20:01
…ssions

casual preset: require an <expression> tag per sentence (same emotion to continue
or a new one), push to change emotions quickly for the full range, and immediately
mirror a shift to a sad topic instead of staying upbeat. adds a sad example.

markup: add convert_expression_to_fish, which prepends 'very' to each <expression>
value for fish audio (e.g. <expression value="regretful"/> -> [very regretful]);
sounds pass through unchanged, an already-'very' value isn't doubled, and other
providers keep using convert_expression_tags.
devin-ai-integration[bot]

This comment was marked as resolved.

cshape added 2 commits June 25, 2026 17:08
…rsers

Resolved conflicts in favor of our tuned _FISHAUDIO_LLM_INSTRUCTIONS (9-emotion
guide, no-em-dash rule, emphasis tag, per-sentence tagging) and our markup parsers
(convert_expression_to_fish for the '[very ...]' intensifier, convert_emphasis_to_fish).
Also removed fishaudio from the convert_expression_tags dispatch so the intensifier
isn't pre-stripped, and deduped the fishaudio key the merge added to the presets registry.
- customer-service: add the missing space after 'fits the moment.' so it no longer
  glues to 'Keep a gentle...'.
- casual: add a newline after the language-switch bullet so it no longer merges into
  the 'sad topic' bullet.
devin-ai-integration[bot]

This comment was marked as resolved.

… soft-tone example

The expressive-fish merge left a second copy of _FISHAUDIO_CUSTOMER_SERVICE and
_FISHAUDIO_CASUAL later in the file. By last-write-wins those stale copies shadowed
our tuned presets (so Fish users got the old prompts). Removed the duplicates so the
tuned presets at the top are the live ones.

Also changed the off-list <expression value="in a soft tone"/> in the shared guide
example to value="sad" (within the allowed emotion set).
devin-ai-integration[bot]

This comment was marked as resolved.

the converter only maps <break time=...> to fish pause markers; a bare
<break/> would leak as raw xml. point the instruction at the timed form.
Comment thread livekit-agents/livekit/agents/tts/markup_utils.py Outdated
soft 'a couple' language drifted hard by model: near-zero on gpt-5.1, a
flood on gemma. a hard min-1/max-3 count with at most one stutter lands
both in the same casual band, and travels to other models better than
vague quantifiers. examples retuned to match the quota (one disfluency
each, sad case uses a gentle hesitation not a stutter).
devin-ai-integration[bot]

This comment was marked as resolved.

…eview)

theo: provider-specific helpers belong with the provider-format logic. move
the three fish converters (the expression 'very' intensifier, break, and
emphasis) plus their fish-only regexes into _provider_format. keep the generic
helpers (convert_expression_tags, convert_break_to_ellipsis, strip_xml_tags,
strip_bracket_tags) in markup_utils, which other modules and the tokenizer
tests still import; the fish expression converter reuses the shared regexes.
@cshape cshape force-pushed the cale/expressive-fish branch from cb72777 to c0d48eb Compare June 26, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants