Run SenseVoiceSmall on the llama.cpp / ggml stack — CPU, edge, a single binary, no Python at runtime. Like whisper.cpp, but for SenseVoice.
SenseVoiceSmall normally runs on PyTorch / ONNX / libtorch. This runtime ports it to ggml + GGUF so it can run CPU-only, offline, embedded in a C/C++ app, with quantized weights. Use it on laptops / phones / edge boxes where there is no GPU and no Python. (For high-QPS GPU serving, the PyTorch/vLLM path is still the way.)
SenseVoiceSmall = SAN-M encoder (70 layers) + CTC head — no LLM, no autoregression. The whole pipeline runs in C++:
audio.wav (16k mono)
│ kaldi 80-mel fbank + LFR (C++)
▼
features [T, 560]
│ prepend 4 query tokens [lang, event, emotion, itn]
▼
[4 + T, 560]
│ SAN-M encoder (ggml) ── sensevoice-small.gguf
▼
encoder out [4+T, 512]
│ CTC head (Linear 512→25055) → greedy CTC (argmax, dedup, drop blank)
▼
token ids
│ SentencePiece detok (detok.py)
▼
<|zh|><|NEUTRAL|><|Speech|><|woitn|> transcription...
The SAN-M encoder is the same architecture as Fun-ASR-Nano's, so the ggml forward is shared between the two runtimes.
1. Build:
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cp -r /path/to/runtime/llama.cpp/funasr-sensevoice examples/
echo 'add_subdirectory(funasr-sensevoice)' >> examples/CMakeLists.txt
cmake -B build -DGGML_NATIVE=ON -DLLAMA_CURL=OFF
cmake --build build -j --target llama-funasr-sensevoice2. Convert weights (needs the checkpoint, e.g. FunAudioLLM/SenseVoiceSmall):
python runtime/llama.cpp/export_sensevoice_gguf.py \
--model_pt <model>/model.pt --mvn <model>/am.mvn \
--out sensevoice-small.gguf # f32, ~936 MB
python runtime/llama.cpp/export_sensevoice_gguf.py --wtype f16 \
--model_pt <model>/model.pt --mvn <model>/am.mvn \
--out sensevoice-small-f16.gguf # half size3. Transcribe:
build/bin/llama-funasr-sensevoice -m sensevoice-small.gguf -a audio.wav # prints transcription text
# --keep-tags keeps the <|lang|>/<|emotion|>/<|event|> tags; --ids prints raw CTC idsExpected output:
我想问我在滨海新区有房我一直没有照顾孩子...你觉得这是正常的想法吗
The leading <|...|> tags are the predicted language / emotion / event / ITN.
- CTC token ids (C++) vs PyTorch: identical (108/108 on a benchmark clip).
- Detokenized text: matches the FunASR
AutoModeloutput exactly. - Encoder validated against PyTorch (shared with Fun-ASR-Nano runtime): cosine 1.0.
- Encode time ≈ 1.3 s on CPU for a 44 s clip.
- No CMVN at inference. SenseVoice
inference()feeds the raw log-mel fbank to the encoder; it does not applyam.mvn. Applying CMVN makes the model predict<|nospeech|>. (The export script readsam.mvnfor completeness but the runtime does not use it.) - Query tokens (4) are prepended from
embed.weight, default indices[language=auto(0), event=1, emotion=2, textnorm=woitn(15)]. Change them for a fixed language or to enable ITN (withitn=14). - WAV input assumes 16 kHz mono PCM16.
- LayerNorm eps = 1e-5; FSMN = exact f32 shift-accumulate; fbank matches torchaudio.
funasr-sensevoice/ ggml runtime: WAV → CTC token ids
export_sensevoice_gguf.py export encoder + CTC head + query embeddings to GGUF
detok.py SentencePiece id → text (bpe model ships with the checkpoint)
- Built-in SentencePiece detok (drop the Python step); arbitrary WAV formats; encoder Q8 quantization; timestamps.