StepFun
StepAudio
StepAudio covers StepFun's audio models. The English docs list StepAudio 2.5 TTS as a contextual text-to-speech model with natural-language control, emotional arcs and zero-shot voice cloning from about 3 seconds of reference audio. They also list step-tts-2, step-tts-mini, stepaudio-2.5-asr for streaming / near-realtime transcription, and stepaudio-2-asr-pro as a 32B ASR Pro model.
Quick answers
At a glance
- Overview
- StepFun's speech model family for contextual TTS, voice cloning, streaming ASR and near-realtime transcription.
- Best fit
- Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.
- Trust
- 2/2 sources verified, recently checked · 2026-05-17
- Coverage
- 100/100 · backfill: freshness
Editorial verdict
Best for
Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.
Avoid if
Avoid it when you need a fully validated international audio workflow without testing signup, consent handling and rate limits.
Why it matters
StepAudio is a distinct capability line and should be visible in the AI Audio category, not hidden under the generic StepFun profile.
Pricing
stepaudio-2.5-tts $0.85 / 10,000 characters; step-tts-2 $0.40 / 10,000 characters; ASR $0.022 / hour; voice cloning $1.50 / voice
Payment
Open Platform balance, Step Plan quota for supported audio models
Commercial use
Commercial use should follow StepFun's audio API terms and any voice cloning consent requirements.
Privacy
Voice cloning and transcription can involve biometric or sensitive audio; consent, retention and data-processing terms need review.
Use-case fit
Expressive dubbing
StrongUse contextual TTS for audiobooks, short drama dubbing, ad narration and emotional storytelling.
Streaming transcription
StrongUse stepaudio-2.5-asr for captions, voice input, meeting transcription and backend batch processing.
Game NPC voice
MediumAudio docs explicitly mention game NPCs as an audio-driven experience use case.
Global user checklist
Pros
- - Contextual TTS supports global and inline natural-language control
- - Zero-shot voice cloning is documented
- - Streaming ASR supports HTTP + SSE incremental output
Cons
- - TTS calls are limited to 1000 characters per request
- - Voice cloning requires strict consent review before production
Decision paths
minimax-audio
qwen-audio
zhipu-glm-audio
seeduplex-audio
Sources
docs · en · verified 2026-05-17
Documents StepAudio 2.5 TTS, step-tts models and ASR models.