Skip to content

fix(stream): decode SSE/hex audio for --stream and handle EPIPE#55

Open
moxi000 wants to merge 3 commits intoMiniMax-AI:mainfrom
moxi000:fix/stream-raw-audio
Open

fix(stream): decode SSE/hex audio for --stream and handle EPIPE#55
moxi000 wants to merge 3 commits intoMiniMax-AI:mainfrom
moxi000:fix/stream-raw-audio

Conversation

@moxi000
Copy link
Copy Markdown

@moxi000 moxi000 commented Apr 9, 2026

Summary

Fixes #54.

  • speech synthesize --stream and music generate --stream now write decoded raw audio bytes to stdout, matching the documentation and making the existing | mpv --no-terminal - example actually work. Previously they wrote the upstream SSE stream verbatim (data: {"data":{"audio":"<hex>"}} envelopes), so any audio player saw garbage and bailed with "Failed to recognize file format".
  • A new shared helper src/utils/audio-stream.ts (pipeAudioSseToStdout) parses SSE frames, JSON-decodes each data: line, hex-decodes data.audio, and writes the bytes to stdout. Buffers across chunk boundaries so partial events are handled correctly.
  • The same helper installs an EPIPE handler on process.stdout, so consumers closing the pipe early (... | head, a player that exits, or mpv not being installed) result in a clean exit(0) instead of an unhandled 'error' event and Node stack trace.
  • Both commands/speech/synthesize.ts and commands/music/generate.ts had the same bug; both now use the shared helper.

Test plan

Built locally with bun run build and verified against the live API:

  • mmx speech synthesize --text "..." --stream > out.mp3 produces a valid MP3 (file reports ID3 v2.4 / MPEG ADTS layer III, 128 kbps, 32 kHz).
  • mmx speech synthesize --text "..." --stream | mpv --no-terminal - plays correctly end-to-end.
  • mmx speech synthesize --text "..." --stream | head -c 5 > /dev/null exits 0 with no EPIPE crash.
  • Music streaming path uses the same helper; would appreciate a maintainer sanity-check against a real music generate --stream call since I only smoke-tested speech.

Notes

  • The SSE parser tolerates [DONE] sentinels, comment/keepalive lines, and the optional single space after data: per the SSE spec.
  • No changes to flags, defaults, or non-stream code paths.

`speech synthesize --stream` and `music generate --stream` were writing
the upstream Server-Sent Events stream verbatim to stdout — JSON envelopes
containing hex-encoded audio chunks — so the documented `| mpv -` pipe
example produced "Failed to recognize file format". The decode logic
already exists for `--out`; the stream path just never used it.

This adds a shared `pipeAudioSseToStdout` helper that parses the SSE
frames, hex-decodes `data.audio`, and writes raw audio bytes to stdout,
buffering across chunk boundaries. Both stream commands now use it.

Also installs an EPIPE handler on stdout so consumers that close the
pipe early (`... | head`, a player exiting, mpv missing) cause a clean
exit instead of an unhandled `'error'` event crash with a Node stack
trace.

Fixes MiniMax-AI#54
Copilot AI review requested due to automatic review settings April 9, 2026 04:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the --stream behavior for speech/music commands so they output decoded raw audio bytes to stdout (instead of the upstream SSE/JSON/hex envelopes) and adds stdout EPIPE handling to avoid crashes when downstream pipes close early.

Changes:

  • Added a shared streaming helper to parse SSE frames, JSON-decode data: payloads, hex-decode data.audio, and write raw bytes to stdout.
  • Updated speech synthesize --stream and music generate --stream to use the shared helper.
  • Added stdout EPIPE handling so early pipe-close exits cleanly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/utils/audio-stream.ts New helper for decoding SSE hex audio to raw stdout + stdout EPIPE handling.
src/commands/speech/synthesize.ts Routes --stream output through the new audio streaming helper.
src/commands/music/generate.ts Routes --stream output through the new audio streaming helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +23
export function installStdoutEpipeHandler(): void {
process.stdout.on('error', (err: NodeJS.ErrnoException) => {
if (err && err.code === 'EPIPE') {
process.exit(0);
}
throw err;
});
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installStdoutEpipeHandler is described as “one-shot”, but it uses process.stdout.on('error', ...) and is called from pipeAudioSseToStdout, so repeated uses will register multiple listeners (risking MaxListenersExceededWarning in tests or long-lived processes). Consider guarding with a module-level flag and/or using once instead of on.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +55
// SSE events are separated by blank lines.
let sep: number;
while ((sep = buffer.indexOf('\n\n')) >= 0) {
const event = buffer.slice(0, sep);
buffer = buffer.slice(sep + 2);
writeEvent(event);
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSE framing here only looks for \n\n as the event separator. If the upstream stream uses CRLF (\r\n\r\n), events may never be split and writeEvent will receive strings containing \r, causing JSON.parse failures. Consider normalizing \r\n to \n before parsing and/or reusing the existing parseSSE helper in src/client/stream.ts to avoid duplicate (and more spec-complete) SSE parsing logic.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +80
function writeEvent(event: string): void {
for (const rawLine of event.split('\n')) {
if (!rawLine.startsWith('data:')) continue;
// Per SSE spec, an optional single space after `data:` should be stripped.
const payload = rawLine.slice(5).replace(/^ /, '');
if (!payload || payload === '[DONE]') continue;

let parsed: { data?: { audio?: string } };
try {
parsed = JSON.parse(payload);
} catch {
// Non-JSON keepalive or comment — skip.
continue;
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeEvent parses each data: line independently. Per SSE spec, a single event can contain multiple data: lines that must be concatenated with \n; parsing line-by-line can drop valid payloads (e.g., pretty-printed JSON) and also fails if lines end with \r. Using the shared parseSSE generator (which already concatenates data lines) or trimming rawLine/payload would make this more robust.

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +86
const hex = parsed?.data?.audio;
if (typeof hex === 'string' && hex.length > 0) {
process.stdout.write(Buffer.from(hex, 'hex'));
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.stdout.write(Buffer.from(hex, 'hex')) ignores backpressure. For long streams this can buffer large amounts in memory if stdout can’t keep up. Consider making the write path async and awaiting drain when stdout.write() returns false (similar to how file download handles backpressure elsewhere in the codebase).

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +108
try {
await pipeAudioSseToStdout(res.body);
} catch (err) {
if (err instanceof Error && err.message === 'No response body') {
throw new CLIError('No response body', ExitCode.GENERAL);
}
throw err;
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling relies on matching err.message === 'No response body' from the helper. Using error-message strings for control flow is brittle; consider exporting a dedicated error class/type (e.g., NoResponseBodyError) from audio-stream.ts, or have pipeAudioSseToStdout throw CLIError directly so callers don’t need to pattern-match messages.

Copilot uses AI. Check for mistakes.
Comment on lines 100 to +104
if (flags.stream) {
const res = await request(config, { url, method: 'POST', body, stream: true });
const reader = res.body?.getReader();
if (!reader) throw new CLIError('No response body', ExitCode.GENERAL);
while (true) {
const { done, value } = await reader.read();
if (done) break;
process.stdout.write(value);
try {
await pipeAudioSseToStdout(res.body);
} catch (err) {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --stream path now decodes SSE->JSON->hex->bytes, but there’s no automated test coverage for this behavior. Consider adding a test that mocks an SSE response (the test suite already has sseResponse in test/helpers/mock-server.ts) and asserts that the command writes the expected raw bytes to stdout and stops on [DONE].

Copilot uses AI. Check for mistakes.
Comment on lines +153 to 160
try {
await pipeAudioSseToStdout(res.body);
} catch (err) {
if (err instanceof Error && err.message === 'No response body') {
throw new CLIError('No response body', ExitCode.GENERAL);
}
throw err;
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling relies on matching err.message === 'No response body' from the helper. Using error-message strings for control flow is brittle; consider exporting a dedicated error class/type (e.g., NoResponseBodyError) from audio-stream.ts, or have pipeAudioSseToStdout throw CLIError directly so callers don’t need to pattern-match messages.

Copilot uses AI. Check for mistakes.
Comment on lines 151 to +155
if (flags.stream) {
const res = await request(config, { url, method: 'POST', body, stream: true });
const reader = res.body?.getReader();
if (!reader) throw new CLIError('No response body', ExitCode.GENERAL);
while (true) {
const { done, value } = await reader.read();
if (done) break;
process.stdout.write(value);
try {
await pipeAudioSseToStdout(res.body);
} catch (err) {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --stream path now decodes SSE->JSON->hex->bytes, but there’s no automated test coverage for this behavior. Consider adding a test that serves an SSE response via the existing mock server helper and verifies the generated bytes written to stdout for music generate --stream (including chunk-boundary buffering).

Copilot uses AI. Check for mistakes.
The MiniMax streaming TTS endpoint emits N incremental SSE events
followed by a final summary event that re-sends the *entire* audio
along with extra_info / metadata (this is what --out saves). Naively
concatenating all events produced an MP3 with the full file appended
after the streamed frames, leaving broken framing in the middle —
ffprobe accepted the file but mpv reported 'mp3float: Header missing /
Error decoding audio'.

Discriminate the summary event by the presence of top-level
'extra_info' (every event carries 'trace_id', so it can't be used).
Verified end-to-end: piping --stream into mpv now plays cleanly with
no decoder warnings, and the resulting bytes contain only the
incremental frames.
@moxi000
Copy link
Copy Markdown
Author

moxi000 commented Apr 9, 2026

Pushed a follow-up fix (84920bc) after reproducing a remaining decode error reported by mpv:

[ffmpeg/audio] mp3float: Header missing
Error decoding audio.

Root cause: the streaming endpoint emits N incremental SSE events followed by a terminal "summary" event that re-sends the entire audio plus metadata (this is what --out saves). The first version of this PR concatenated all events, so the complete file was appended after the incremental frames — ffprobe accepted the result but mpv's decoder choked on the broken framing where the two segments meet.

Captured raw SSE structure for mmx speech synthesize --text "Stream me" --stream:

ev1 topKeys=['base_resp','data','trace_id']                dataKeys=['audio','ced','status'] audioLen=9306
ev2 topKeys=['base_resp','data','trace_id']                dataKeys=['audio','ced','status'] audioLen=24192
ev3 topKeys=['base_resp','data','trace_id']                dataKeys=['audio','ced','status'] audioLen=4608
ev4 topKeys=['base_resp','data','trace_id']                dataKeys=['audio','ced','status'] audioLen=0
ev5 topKeys=['base_resp','data','extra_info','trace_id']   dataKeys=['audio','ced','status'] audioLen=40296   ← summary

Note that trace_id is on every event, so it can't be used as the discriminator — only extra_info is unique to the summary event. The fix skips events with top-level extra_info.

Re-verified after rebuild:

  • mmx speech synthesize --text "Stream me" --stream | mpv - plays cleanly, no Header missing / Error decoding audio
  • mpv on the captured .mp3 file: no decoder warnings
  • ... | head -c 5 still exits 0 (EPIPE handling intact)

Worth confirming with maintainers whether the same envelope shape applies to music generate --stream (the helper is shared, so the same fix applies if so).

- Reuse the existing parseSSE generator from src/client/stream.ts
  instead of hand-rolling SSE framing. parseSSE already handles
  multi-line data fields, comments, and trailing buffer flush
  correctly, eliminating duplicated and less spec-complete parsing.
- Make installStdoutEpipeHandler idempotent via a module-level flag,
  so repeated calls don't register multiple listeners and trigger
  MaxListenersExceededWarning in tests or long-lived processes.
- Export a dedicated NoResponseBodyError class instead of relying on
  err.message string matching for control flow in callers.
- Honor stdout backpressure: await 'drain' when write() returns false,
  so long streams don't accumulate large buffers in memory.
- pipeAudioSseToStdout now takes the Response directly (parseSSE owns
  the body reader), simplifying both call sites.
@moxi000
Copy link
Copy Markdown
Author

moxi000 commented Apr 9, 2026

Pushed 070699a addressing the Copilot review:

Skipped for this round:

Re-verified end-to-end after rebuild:

  • mmx speech synthesize --text "Stream me" --stream | mpv - → plays cleanly, no decoder warnings
  • mpv on the captured .mp3 file → no warnings
  • mmx speech synthesize ... --stream | head -c 5 > /dev/null → exit 0, no EPIPE crash

@ryanlee666
Copy link
Copy Markdown
Collaborator

Thanks for commit. Please solve conflicts and commit again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

speech synthesize --stream outputs raw SSE/JSON instead of decoded audio bytes

3 participants