Skip to content

feat: add video SEF/subject-ref, image seed/size, speech clone/design, music-2.5+#66

Closed
raylanlin wants to merge 0 commit intoMiniMax-AI:mainfrom
raylanlin:main
Closed

feat: add video SEF/subject-ref, image seed/size, speech clone/design, music-2.5+#66
raylanlin wants to merge 0 commit intoMiniMax-AI:mainfrom
raylanlin:main

Conversation

@raylanlin
Copy link
Copy Markdown

@raylanlin raylanlin commented Apr 9, 2026

Summary

This PR adds comprehensive new features across all generation modules, extensive bug fixes, and full test coverage.

18 files changed, +1197/-143 lines, 11 commits, 109 tests pass


New Features

🎬 Video Generation — src/commands/video/generate.ts (+54 lines)

--last-frame <path-or-url> — SEF (Start-End Frame) Interpolation

Generates a video that smoothly transitions between a start frame (from prompt) and an end frame (provided image).

mmx video generate \
  --prompt "A flower blooming in spring garden" \
  --last-frame ./end-frame.jpg
  • Automatically switches model to hailuo-02 (required for SEF mode)
  • Accepts local file path or URL
  • Note: Explicit --model hailuo-02 is still required if your plan doesn't include this model

--subject-image <path-or-url> — Subject-to-Video (S2V)

Keeps a character/object consistent throughout the generated video.

mmx video generate \
  --prompt "walking through a neon-lit cyberpunk city" \
  --subject-image ./character.png
  • Automatically switches model to s2v-01 (required for S2V mode)
  • File uploads to /v1/files/upload automatically
  • Note: Explicit --model s2v-01 is still required if your plan doesn't include this model

Model Override Priority

Explicit --model flag now takes priority over automatic model switching. If you specify --model hailuo-01 with --last-frame, it will try to use hailuo-01 (and fail if the API doesn't support it), rather than silently switching.


🖼️ Image Generation — src/commands/image/generate.ts (+34 lines)

--seed <n> — Reproducible Generation

mmx image generate --prompt "A sunset" --seed 42 --out a.png
mmx image generate --prompt "A sunset" --seed 42 --out b.png
# a.png and b.png are identical (MD5 match)

--width <px> / --height <px> — Custom Dimensions

mmx image generate --prompt "Wide banner" --width 2048 --height 512
  • Range: [512, 2048], must be multiple of 8
  • Overrides --aspect-ratio when both are set
  • Only effective for image-01 model

--prompt-optimizer — AI Prompt Enhancement

mmx image generate --prompt "A cat" --prompt-optimizer
# Prompt is sent through LLM enhancement before generation

--aigc-watermark — AI Content Watermark

mmx image generate --prompt "Logo design" --aigc-watermark
# Adds standard AI generation watermark per Chinese regulations

🗣️ TTS — New Commands

speech clone — Voice Cloning (src/commands/speech/clone.ts, 110 lines)

Clone a voice from an audio sample.

mmx speech clone --audio ./my-voice.mp3 --name "My Voice"
# → voice_id returned for use with speech synthesize

mmx speech clone --audio ./sample.wav --name "Carol" --quiet
# Output: vc_xxxxxxxx
  • Uploads audio to /v1/files/upload first, then calls voice_clone API
  • Supports mp3, wav, ogg formats
  • Returns voice_id for use with mmx speech synthesize --voice <voice_id>

speech design — Voice Design (src/commands/speech/design.ts, 70 lines)

Create a voice from a text description.

mmx speech design --prompt "Warm female voice, slightly raspy, suitable for audiobook narration"
# → voice_id returned for use with speech synthesize

mmx speech design --prompt "Deep male narrator voice" --gender male
  • Calls voice_design API with text description
  • Optional --gender hint (male/female)
  • Returns voice_id for use with mmx speech synthesize --voice <voice_id>

🎵 Music Generation — src/commands/music/generate.ts (+127 lines)

music-2.5+ Model with Native Instrumental Support

# Instrumental music — no lyrics needed
mmx music generate --prompt "Cinematic orchestral, building tension" --instrumental

# With lyrics (music-2.5+)
mmx music generate \
  --prompt "Indie folk, melancholic" \
  --lyrics "[Verse]
Rain on the window pane
[Chorus]
I'm waiting for the sun to come"

--lyrics-optimizer — AI-Generated Lyrics

mmx music generate --prompt "Upbeat pop song about summer" --lyrics-optimizer --out summer.mp3
# Lyrics auto-generated from prompt, then used for music generation

--output-format url — Direct Download URL

mmx music generate --prompt "Lo-fi hip hop" --instrumental --output-format url
# → URL returned (24h expiry — download promptly!)

Expanded Lyric Tags (14 Total)

Tag Usage
[Intro] Song opening
[Verse] Main narrative section
[Pre Chorus] Build-up to chorus
[Chorus] Hook/main melody
[Interlude] Instrumental break
[Bridge] Contrasting section
[Outro] Song ending
[Post Chorus] After chorus
[Transition] Section bridge
[Break] Rhythmic pause
[Hook] Catchy repeated phrase
[Build Up] Tension building
[Inst] Instrumental section
[Solo] Solo performance

⚠️ Tags must be clean — no descriptions inside brackets. [Verse: piano] will be sung as lyrics.


Bug Fixes

# File Bug Fix
1 src/client/endpoints.ts File upload endpoint /v1/files returned 404 Changed to /v1/files/upload
2 src/output/audio.ts extra_info field names didn't match API response audio_lengthmusic_duration, audio_sizemusic_size, audio_sample_ratemusic_sample_rate
3 src/registry.ts Compile wrapper pointed to old dist/minimax.mjs Updated to dist/mmx.mjs
4 src/commands/music/generate.ts --instrumental with "无歌词" still sent lyrics field Set lyrics = undefined when using "无歌词"
5 src/commands/video/generate.ts Explicit --model was overwritten by auto-switch Explicit flag now takes priority
6 src/commands/music/generate.ts URL output went to stderr Changed to console.log (stdout)
7 src/commands/music/generate.ts Auto-truncation of prompt/lyrics hid API errors Removed truncation, let API handle validation

Tests

Coverage: 109 pass / 0 fail across 25 test files

Test File Tests Status
test/commands/image/generate.test.ts 17 (expanded from 5)
test/commands/video/generate.test.ts 10 (expanded from 2)
test/commands/speech/clone.test.ts 7 (new)
test/commands/speech/design.test.ts 7 (new)
Other existing tests 68

Test Highlights

  • Seed reproducibility: same seed + prompt → identical MD5 hash
  • Dimension validation: rejects non-multiples of 8, out-of-range values
  • Mutual exclusivity: --width + --aspect-ratio → warning
  • Model auto-switching: --last-frame → Hailuo-02, --subject-image → S2V-01
  • Explicit model override: --model hailuo-01 --last-frame → uses hailuo-01 (not auto-switched)
  • Voice clone flow: file upload → voice_clone API call
  • Voice design flow: text description → voice_design API call

Documentation

skill/SKILL.md (+160 lines)

  • Full parameter tables for all new features
  • Usage examples with real-world scenarios
  • Piping patterns for agent workflows
  • Lyrics structure tags规范 with warnings
  • Model compatibility matrix

README.md (+18 lines)

  • Updated feature list
  • New command examples for all modules
  • Quick start section with common workflows

--help (all commands)

  • music: 14 lyric tags list, "no descriptions in tags" warning, 3500 char limit
  • image: seed reproducibility note, width/height range [512,2048], 8-multiple requirement
  • video: SEF mode explanation, subject-image model requirements
  • speech synthesize: model characteristics, speed/volume/pitch ranges, supported formats
  • text chat: temperature range 0.0-1.0 (default 0.7), top-p default 0.95, tool OpenAI compatibility
  • search query: natural language support, pipeline examples

API Reference

All features verified against official MiniMax API documentation:


Commits

Commit Type Description
281c6d2 skill Lyrics structure tags规范 and album song example
4a04cc4 feat music-2.5+ support, lyrics-optimizer, output-format url
3800064 fix Remove auto-truncation, let API handle length errors
fb7ddd0 fix Clean up music-2.5+ instrumental, remove duplicate warnings
41d762b fix URL output to stdout, extra_info field names, compile wrapper
7e7fb50 feat Video SEF/subject-ref, image seed/size/optimizer, speech clone/design
847c9b4 fix File upload endpoint /v1/files/v1/files/upload
1e2462c test Comprehensive tests + video model override fix
f928de2 docs SKILL.md, README.md, --help sync for all features
233ff94 docs Improve --help for music/image/video with detailed descriptions
e98eb11 docs Improve --help for text chat/search/speech synthesize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant