Step-Audio-R1/R1.5

🔥🔥🔥 News!!

Apr 29, 2026: 🚀 We release the technical report of Step-Audio-R1.5 (ArXiv; PDF).
Apr 29, 2026: 📦 We open-source three in-house benchmarks from Step-Audio-R1.5 under benchmarks/Step-Audio-R1.5/: step_caption, step_spqa, and step_dialogue_understanding.
Jan 14, 2026: 🚀 We release the inference code and model weights of Step-Audio-R1.1 (HuggingFace; ModelScope)
Nov 27, 2025: 🎉 We release the inference code and model weights of Step-Audio-R1 (HuggingFace; ModelScope)
Nov 27, 2025: 🎮 We released the HF Space Playground
Nov 19, 2025: 🎉 We release the Demo Page
Nov 19, 2025: 👋 We release the technical report of Step-Audio-R1.

WeChat Developer Group

📑 Open-source Plan

Inference Code (vLLM)
Online demo (Gradio)
Model Checkpoints

📚 Open Benchmarks

We release three standalone evaluation benchmarks from Step-Audio-R1.5 under benchmarks/Step-Audio-R1.5/:

step_caption
step_spqa
step_dialogue_understanding

Please see benchmarks/Step-Audio-R1.5/README.md and the benchmark-local READMEs for dataset details.

Overview of R1.5

To-Do List

Technical report release (ArXiv; PDF)
Open-source benchmark release (benchmarks/Step-Audio-R1.5/)
Inference code for Step-Audio-R1.5

Introduction

Step-Audio-R1.5 is our latest audio reasoning model, described in the technical report and mirrored in this repository as Step-Audio-R1.5.pdf.

While recent reasoning-oriented audio models benefit from extended Chain-of-Thought on objective benchmarks, they are often optimized with verifiable reward signals that compress rich auditory interaction into isolated labels. Step-Audio-R1.5 is designed to move beyond that limitation: instead of only maximizing benchmark correctness, it aims to preserve prosodic naturalness, emotional continuity, and immersive long-turn spoken interaction.

RLHF for Audio Reasoning

Step-Audio-R1.5 marks a shift from purely verifiable-reward-style optimization toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning.

This transition is motivated by a simple observation: strong objective scores alone do not guarantee a natural conversational experience. By incorporating preference-driven alignment for spoken interaction, Step-Audio-R1.5 maintains robust analytical reasoning while substantially improving the overall interactive feel of long-form audio dialogue.

Benchmark Results

The figure below compares the average score across all eight speech-to-text benchmarks. Step-Audio-R1.5 substantially improves over Step-Audio-R1 and remains highly competitive with other mainstream large models.

Overview of R1.1

Introduction

Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both real-time responsiveness and strong reasoning capability.

Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables thinking while speaking, achieving high intelligence without sacrificing speed.

Mind-Paced Speaking (Low Latency)

Based on the research Mind-Paced Speaking, the Realtime variant adopts a Dual-Brain Architecture:

A Formulation Brain responsible for high-level reasoning
An Articulation Brain dedicated to speech generation

This decoupling allows the model to perform Chain-of-Thought reasoning during speech output, maintaining ultra-low latency while handling complex tasks in real time.

Acoustic-Grounded Reasoning (High Intelligence)

To address the inverted scaling issue—where reasoning over transcripts can degrade performance—Step-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone.

Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to state-of-the-art performance, including top-ranking results on the AA benchmark.

Overview of R1

Introduction

Step-Audio-R1 is the first audio language model to successfully unlock test-time compute scaling. It decisively solves the "inverted scaling" anomaly plaguing existing models, where performance paradoxically degrades with longer reasoning chains.

We identify the root cause of this failure as Textual Surrogate Reasoning: conventional models, due to text-based initialization, rely on linguistic abstractions (analyzing transcripts) rather than genuine acoustic properties. To resolve this modality mismatch, we introduce Modality-Grounded Reasoning Distillation (MGRD), an iterative training framework that shifts the model's reasoning focus from textual surrogates to acoustic analysis.

This new approach allows us to create Step-Audio-R1, which:

Is the first audio reasoning model that successfully benefits from test-time compute scaling.
Surpasses Gemini 2.5 Pro and is comparable to Gemini 3 across comprehensive audio benchmarks.
Transforms extended deliberation from a liability into a powerful asset for audio intelligence.

Model Architecture

Step-Audio-R1 builds on the architecture of our previous StepAudio 2 and consists of three main components:

Audio Encoder: We use the pre-trained Qwen2 audio encoder. It operates at a 25 Hz frame rate and is frozen during training.
Audio Adaptor: A simple adaptor (identical to Step-Audio 2) connects the encoder to the LLM and downsamples the feature frame rate to 12.5 Hz.
LLM Decoder: We use Qwen2.5 32B as the core reasoning component. It directly takes the latent audio features from the adaptor to generate a purely textual output (first the reasoning, then the final reply).

The key innovation is our training method, Modality-Grounded Reasoning Distillation (MGRD). This process iteratively refines the model's thoughts, progressively strengthening their connection to the underlying audio features until they evolve into "native audio think."

This ensures the model's reasoning is not merely about the transcribed text but is deeply grounded in the acoustic nuances of the audio itself.

Model Usage

📜 Requirements

GPU: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
Operating System: Linux.
Python: >= 3.10.0.

⬇️ Download Model

First, you need to download the Step-Audio-R1 model weights.

Method A · Git LFS

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1

Method B · Hugging Face CLI

hf download stepfun-ai/Step-Audio-R1 --local-dir ./Step-Audio-R1

🚀 Deployment and Execution

We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

Pull the image:

docker pull stepfun2025/vllm:step-audio-2-v20250909

Start the service: Assuming the model is downloaded in the Step-Audio-R1 folder in the current directory.

docker run --rm -ti --gpus all \
    -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
    -p 9999:9999 \
    stepfun2025/vllm:step-audio-2-v20250909 \
    -- vllm serve /Step-Audio-R1 \
    --served-model-name Step-Audio-R1 \
    --port 9999 \
    --max-model-len 16384 \
    --max-num-seqs 32 \
    --tensor-parallel-size 4 \
    --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
    --enable-log-requests \
    --interleave-mm-strings \
    --trust-remote-code

After the service starts, it will listen on localhost:9999.

🐳 Method 2 · Run from Source (Compile vLLM)

Step-Audio-R1 requires a customized vLLM backend.

Download Source Code:

git clone https://github.com/stepfun-ai/vllm.git
cd vllm

Prepare Environment:

python3 -m venv .venv
source .venv/bin/activate

Install and Compile: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.
```
# Use pre-compiled C++ extensions (Recommended)
VLLM_USE_PRECOMPILED=1 pip install -e .
```
Switch Branch: After compilation, switch to the branch that supports Step-Audio.
```
git checkout step-audio-2-mini
```

Start the Service:

# Ensure you are in the vllm directory and the virtual environment is activated
source .venv/bin/activate

python3 -m vllm.entrypoints.openai.api_server \
    --model ../Step-Audio-R1 \
    --served-model-name Step-Audio-R1 \
    --port 9999 \
    --host 0.0.0.0 \
    --max-model-len 65536 \
    --max-num-seqs 128 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --enable-log-requests \
    --interleave-mm-strings \
    --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'

After the service starts, it will listen on localhost:9999.

🧪 Client Examples

Get the example code and run it:

# Clone the repository containing example scripts
git clone https://github.com/stepfun-ai/Step-Audio-R1.git r1-scripts
cd r1-scripts

# Run the example
python examples-vllm_r1.py

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{tian2025step,
  title={Step-Audio-R1 Technical Report},
  author={Tian, Fei and Zhang, Xiangyu Tony and Zhang, Yuxin and Zhang, Haoyang and Li, Yuxin and Liu, Daijiao and Deng, Yayue and Wu, Donghang and Chen, Jun and Zhao, Liang and others},
  journal={arXiv preprint arXiv:2511.15848},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
__pycache__		__pycache__
assets		assets
benchmarks/Step-Audio-R1.5		benchmarks/Step-Audio-R1.5
LICENSE		LICENSE
README.md		README.md
Step-Audio-R1.5.pdf		Step-Audio-R1.5.pdf
Step-Audio-R1.pdf		Step-Audio-R1.pdf
examples-vllm_r1.py		examples-vllm_r1.py
stepaudior1vllm.py		stepaudior1vllm.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step-Audio-R1/R1.5

🔥🔥🔥 News!!

📑 Open-source Plan

📚 Open Benchmarks

Overview of R1.5

To-Do List

Introduction

RLHF for Audio Reasoning

Benchmark Results

Overview of R1.1

Introduction

Mind-Paced Speaking (Low Latency)

Acoustic-Grounded Reasoning (High Intelligence)

Overview of R1

Introduction

Model Architecture

Model Usage

📜 Requirements

⬇️ Download Model

🚀 Deployment and Execution

🐳 Method 1 · Run with Docker (Recommended)

🐳 Method 2 · Run from Source (Compile vLLM)

🧪 Client Examples

Citation

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Step-Audio-R1/R1.5

🔥🔥🔥 News!!

📑 Open-source Plan

📚 Open Benchmarks

Overview of R1.5

To-Do List

Introduction

RLHF for Audio Reasoning

Benchmark Results

Overview of R1.1

Introduction

Mind-Paced Speaking (Low Latency)

Acoustic-Grounded Reasoning (High Intelligence)

Overview of R1

Introduction

Model Architecture

Model Usage

📜 Requirements

⬇️ Download Model

🚀 Deployment and Execution

🐳 Method 1 · Run with Docker (Recommended)

🐳 Method 2 · Run from Source (Compile vLLM)

🧪 Client Examples

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages