StepFun's speech model family for contextual TTS, voice cloning, streaming ASR and near-realtime transcription.

StepAudio English UI and API

English UI: full · API: available

StepFun

StepAudio

Name: StepAudio
Price: stepaudio-2.5-tts $0.85 / 10,000 characters; step-tts-2 $0.40 / 10,000 characters; ASR $0.022 / hour; voice cloning $1.50 / voice
Availability: LimitedAvailability
Rating: 4.1 (3 reviews)

StepAudio covers StepFun's audio models. The English docs list StepAudio 2.5 TTS as a contextual text-to-speech model with natural-language control, emotional arcs and zero-shot voice cloning from about 3 seconds of reference audio. They also list step-tts-2, step-tts-mini, stepaudio-2.5-asr for streaming / near-realtime transcription, and stepaudio-2-asr-pro as a 32B ASR Pro model.

Partially availableFull English UIPublic APIPaidTrusted

Quick answers

At a glance

Overview: StepFun's speech model family for contextual TTS, voice cloning, streaming ASR and near-realtime transcription.
Best fit: Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.
Trust: 2/2 sources verified, recently checked · 2026-05-17
Coverage: 100/100 · backfill: freshness

Editorial verdict

Best for

Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.

Avoid if

Avoid it when you need a fully validated international audio workflow without testing signup, consent handling and rate limits.

Why it matters

StepAudio is a distinct capability line and should be visible in the AI Audio category, not hidden under the generic StepFun profile.

Pricing

stepaudio-2.5-tts $0.85 / 10,000 characters; step-tts-2 $0.40 / 10,000 characters; ASR $0.022 / hour; voice cloning $1.50 / voice

Payment

Open Platform balance, Step Plan quota for supported audio models

Commercial use

Commercial use should follow StepFun's audio API terms and any voice cloning consent requirements.

Privacy

Voice cloning and transcription can involve biometric or sensitive audio; consent, retention and data-processing terms need review.

Use-case fit

Expressive dubbing

Strong

Use contextual TTS for audiobooks, short drama dubbing, ad narration and emotional storytelling.

Streaming transcription

Strong

Use stepaudio-2.5-asr for captions, voice input, meeting transcription and backend batch processing.

Game NPC voice

Medium

Audio docs explicitly mention game NPCs as an audio-driven experience use case.

Global user checklist

RegistrationPartialAudio APIs require platform account and API key.

English UIConfirmedEnglish audio model and API docs are available.

API and docsConfirmedDocs cover TTS, streaming TTS, ASR, voice cloning and voice listing APIs.

Commercial usePartialVoice cloning consent and production data terms need policy review.

Coverage · 100/100 · backfill: freshness

Pros

- Contextual TTS supports global and inline natural-language control
- Zero-shot voice cloning is documented
- Streaming ASR supports HTTP + SSE incremental output

Cons

- TTS calls are limited to 1000 characters per request
- Voice cloning requires strict consent review before production

Decision paths

minimax-audio

qwen-audio

zhipu-glm-audio

seeduplex-audio

Sources

StepFun audio models

docs · en · verified 2026-05-17

Documents StepAudio 2.5 TTS, step-tts models and ASR models.

StepFun pricing and rate limits

pricing · en · verified 2026-05-17

Documents speech pricing.