How many tools match this use case?

24 tools currently match this use case.

Chinese AI tools for speech and audio

Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.

Quick answers

At a glance

What it covers: Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.
Matched tools: 24 tools currently match this use case.
How to read this page: Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.

Decision standard

Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.

24 matched tools

Kuaishou

Kling AI

4.7

Kling has an English-facing site and a broader creative and API platform beyond text-to-video generation.

Best fit · Creators, studios and growth teams that want a globally accessible Chinese creative studio for video, image, sound, effects and API-backed generation.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Card / Credits
Checked: May 19
Sources: High confidence
From: Free tier, paid credits vary by region

View product

Alibaba Cloud

Qwen

4.6

Qwen Cloud makes Qwen available for global evaluation through an English marketplace, docs, pricing and compatible API paths.

Best fit · Developers evaluating Qwen3.6, Qwen Cloud APIs, coding agents and multimodal Chinese model coverage from an English international platform.

Coverage · 100/100

Globally availablePartial English UITrustedPublic APIFreemium

Payment: Free web access / Qwen Cloud billing
Checked: May 20
Sources: High confidence
From: Free tier, pay-as-you-go API usage and Token Plan subscriptions vary by model

View product

Alibaba Cloud

Qwen Audio / CosyVoice

4.2

Qwen Cloud has enough official audio evidence to warrant a separate audio-category profile.

Best fit · Teams evaluating Chinese speech synthesis, voice cloning, ASR and realtime speech APIs through an English platform.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Qwen Cloud billing / Token Plan where supported
Checked: May 17
Sources: High confidence
From: Free tier and pay-as-you-go speech API billing vary by model

View product

Alibaba Cloud

Qwen Cloud Token Plan

4.0

Token Plan is a distinct commercial route for Qwen Cloud and affects how overseas developers actually consume the models.

Best fit · Developers who want a subscription-style Qwen Cloud route for coding tools and compatible agents.

Coverage · 96/100 · backfill: sources

Globally availableFull English UIPartial evidencePublic APIPaid

Payment: Token Plan subscription / Qwen Cloud billing
Checked: May 17
Sources: Medium confidence
From: Standard, Pro and Max Token Plan tiers; plan prices and quotas should be checked live

View product

Zhipu AI

Z.ai BigModel / GLM

4.4

Z.ai now has an English product surface, while BigModel remains the API evidence base for the full GLM product line.

Best fit · Developers comparing Chinese multimodal model APIs, agent services and OpenAI-compatible migration paths.

Coverage · 100/100

Partially availablePartial English UITrustedPublic APIFreemium

Payment: Free trial tokens / Platform billing
Checked: May 17
Sources: High confidence
From: 20 million free tokens are promoted on the English site; model pricing varies by API

View product

Zhipu AI

GLM Audio

3.9

Audio is now a documented GLM capability family and should be visible in the audio category.

Best fit · Developers evaluating Chinese speech, voice clone, ASR and realtime multimodal APIs.

Coverage · 100/100

Partially availablePartial English UITrustedPublic APIFreemium

Payment: Free model where available / Platform billing
Checked: May 17
Sources: High confidence
From: Usage-based audio API pricing varies by model

View product

Ant Group

Ant Ling

4.2

Ant Ling now has enough English-facing product, docs, pricing and integration evidence to be tracked alongside Qwen, DeepSeek, Kimi and GLM.

Best fit · Developers evaluating Chinese model APIs for long context, coding agents, reasoning models and OpenAI/Anthropic-compatible migration.

Coverage · 100/100

Partially availableFull English UITrustedPublic APIFreemium

Payment: Free daily quota / Pay-as-you-go API billing
Checked: May 17
Sources: High confidence
From: 500,000 free tokens daily per account; Ling-2.6-flash starts at ¥0.60 input / ¥1.80 output per 1M tokens; Ling-2.6-1T starts at ¥2.00 input / ¥16.00 output per 1M tokens

View product

Ant Group

Ming

4.0

Ming is the multimodal branch of Ant Ling and deserves separate tracking from text-only Ling and reasoning-focused Ring.

Best fit · Teams tracking open Chinese full-modal models across image-text understanding, video analysis, speech synthesis and image generation.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrustedLimited APIUnknown

Payment: Ant Ling API billing where available / Open-source model access where available
Checked: May 17
Sources: High confidence
From: Ming pricing and API availability should be verified from current Ant Ling console and model docs

View product

MiniMax

MiniMax API Platform

4.4

The international docs show MiniMax as a full multimodal API platform rather than only a Hailuo video product.

Best fit · Developers who want one China-origin platform for coding models, speech, video, image, music and multimodal agent tooling.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Token Plan / Credits
Checked: May 17
Sources: High confidence
From: Token Plan, Credits and pay-as-you-go API billing vary by modality and model

View product

MiniMax

MiniMax Audio / Speech

4.3

MiniMax Audio deserves a separate profile because the official API docs cover a mature speech product line beyond general model chat.

Best fit · Teams evaluating Chinese speech synthesis, voice cloning and multilingual audio generation APIs.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Audio Subscription / Token Plan
Checked: May 17
Sources: High confidence
From: Audio Subscription, Token Plan quotas, Credits and pay-as-you-go billing vary by model

View product

Meituan LongCat

LongCat-AudioDiT

4.2

LongCat-AudioDiT belongs in AI Audio because it is a direct-text-to-speech and voice-cloning model with released code and weights, not a generic research paper.

Best fit · Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.

Coverage · 100/100 · backfill: freshness

Globally availableFull English UITrustedLimited APIFree

Payment: GitHub repository / Model weights download
Checked: May 18
Sources: High confidence
From: Open-source MIT repository and released model weights; inference runs locally or through a Hugging Face-compatible workflow

View product

MiniMax

MiniMax Music

4.2

MiniMax Music is a distinct international product line in the official docs and should not be hidden inside a generic API profile.

Best fit · Creators and developers evaluating Chinese music generation APIs for songs, covers and app soundtracks.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Token Plan / Credits
Checked: May 17
Sources: High confidence
From: Token Plan music quotas, Credits and pay-as-you-go billing vary by model

View product

MiniMax

Talkie

4.0

Talkie belongs in the MiniMax international product map because the official site lists it as a product, but it is a consumer app rather than an API surface.

Best fit · Users comparing MiniMax's consumer character and companion AI distribution.

Coverage · 100/100

Partially availableFull English UITrustedNo APIFreemium

Payment: App billing / Free access where available
Checked: May 17
Sources: High confidence
From: Consumer app pricing may vary by region

View product

Baidu AI Cloud

ERNIE / Baidu Qianfan

4.1

Baidu now has English model communications through the ERNIE Blog, while Qianfan remains the main platform when enterprise platform, agent orchestration and China-cloud deployment matter.

Best fit · Enterprises and developers already evaluating Baidu Cloud, China-local deployment, agent platforms or ERNIE multimodal models.

Coverage · 100/100

Partially availablePartial English UITrustedPublic APIPaid

Payment: Baidu Cloud billing
Checked: May 17
Sources: High confidence
From: Baidu Cloud usage-based pricing

View product

ByteDance / Volcano Engine

ByteDance Seed / Doubao Ark

4.2

ByteDance Seed is now a broad model portfolio rather than a single Doubao API entry, so it should be tracked as a foundation-model and model-platform family.

Best fit · Developers and teams comparing ByteDance's English-facing Seed model roadmap with commercial Doubao/Ark API access.

Coverage · 100/100

Partially availableFull English UITrustedPublic APIPaid

Payment: Volcano Engine billing
Checked: May 17
Sources: High confidence
From: Volcano Engine usage-based pricing

View product

ByteDance / Volcano Engine

Seedance 2.0

4.4

Seedance 2.0 is ByteDance Seed's named video model and provides a direct way to track video capability instead of only through Jimeng or generic Doubao/Ark.

Best fit · Creators and developers comparing Chinese video models with multimodal input, audio-video generation and API access.

Coverage · 100/100

Partially availableFull English UITrustedPublic APIPaid

Payment: BytePlus billing / Volcano Engine billing
Checked: May 17
Sources: High confidence
From: API and Try Now access are linked from the official page; pricing should be checked in BytePlus or Volcano Engine

View product

ByteDance / Volcano Engine

Seeduplex

3.9

Seeduplex gives ByteDance Seed a distinct voice-interaction profile beyond text, image and video models.

Best fit · Teams tracking Chinese full-duplex speech models, realtime voice agents and multimodal interaction research.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrustedLimited APIUnknown

Payment: BytePlus billing / Volcano Engine billing
Checked: May 17
Sources: High confidence
From: Voice model access and pricing should be verified through BytePlus or Volcano Engine

View product

Skywork AI

Skywork

4.2

Skywork should be tracked as a workspace platform because the public surfaces are organized around task-specific agents and output formats, not one generic chat flow.

Best fit · Knowledge workers who want one cloud workspace for research, writing, slides, sheets, websites and short-form media output.

Coverage · 100/100 · backfill: access signals

Globally availableFull English UITrustedUnknownFreemium

Payment: Free plan / Paid tiers
Checked: May 22
Sources: High confidence
From: Free plan available; paid tiers vary by product

View product

iFlytek

SparkDesk / iFlytek Spark

4.0

Spark is strategically important for speech and vertical applications, while overseas consumer readiness remains limited.

Best fit · Teams evaluating Chinese speech AI, education, healthcare or voice-heavy assistant workflows.

Coverage · 98/100 · backfill: sources

Partially availablePartial English UIPartial evidencePublic APIFreemium

Payment: Alipay / WeChat Pay
Checked: May 14
Sources: Medium confidence
From: Spark Lite free, paid token pricing varies by model

View product

StepFun

StepFun / Step

4.3

StepFun is important because it combines multimodal model depth, open-source releases and device commercialization, but overseas usability still needs hands-on checks.

Best fit · Teams evaluating Chinese multimodal models, open-source agent models, video/audio generation or device-side AI partnerships.

Coverage · 100/100

Partially availableFull English UITrustedPublic APIPaid

Payment: Platform billing / Step Plan
Checked: May 29
Sources: High confidence
From: Usage-based API and Step Plan subscription paths

View product

StepFun

StepFun Open Platform

4.2

The English platform makes StepFun more actionable for overseas developers than a company-only profile.

Best fit · Developers comparing Chinese model APIs for text, reasoning, tool calling, multimodal generation and OpenAI-compatible migration.

Coverage · 100/100

Partially availableFull English UITrustedPublic APIPaid

Payment: Account balance / Free credit first
Checked: May 29
Sources: High confidence
From: Reasoning models start at $0.10 input cache miss / $0.02 cache hit / $0.30 output per 1M tokens; image editing is $0.003 per image

View product

StepFun

StepAudio

4.1

StepAudio is a distinct capability line and should be visible in the AI Audio category, not hidden under the generic StepFun profile.

Best fit · Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.

Coverage · 100/100 · backfill: freshness

Partially availableFull English UITrustedPublic APIPaid

Payment: Open Platform balance / Step Plan quota for supported audio models
Checked: May 17
Sources: High confidence
From: stepaudio-2.5-tts $0.85 / 10,000 characters; step-tts-2 $0.40 / 10,000 characters; ASR $0.022 / hour; voice cloning $1.50 / voice

View product

Xiaomi MiMo

4.5

MiMo matters because Xiaomi is exposing an English-facing model platform across web, API, AI Studio and open-source channels.

Best fit · Developers comparing Chinese agent models, multimodal model families, speech models, English web demos and open-source deployment options.

Coverage · 100/100

Globally availableFull English UITrustedPublic APIFreemium

Payment: Card / Web Demo
Checked: May 17
Sources: High confidence
From: MiMo-V2-Flash blog lists $0.1 input / $0.3 output per 1M tokens; V2.5 model pricing should be checked inside the API platform

View product

Xiaomi MiMo

MiMo Speech Models

4.0

MiMo now has enough English-facing speech signals to deserve a separate audio profile.

Best fit · Teams watching Xiaomi's speech stack for ASR, TTS and voice-agent experiments.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrustedLimited APIUnknown

Payment: API Platform billing / AI Studio
Checked: May 17
Sources: High confidence
From: Speech-model pricing not publicly visible on the English homepage; verify inside MiMo API Platform

View product

Why these tools qualify

Kling AI: Kling AI is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen: Qwen is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen Audio / CosyVoice: Qwen Audio / CosyVoice is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen Cloud Token Plan: Qwen Cloud Token Plan is relevant because its profile includes speech, audio, voice or music generation evidence.

Z.ai BigModel / GLM: Z.ai BigModel / GLM is relevant because its profile includes speech, audio, voice or music generation evidence.

GLM Audio: GLM Audio is relevant because its profile includes speech, audio, voice or music generation evidence.

Ant Ling: Ant Ling is relevant because its profile includes speech, audio, voice or music generation evidence.

Ming: Ming is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax API Platform: MiniMax API Platform is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax Audio / Speech: MiniMax Audio / Speech is relevant because its profile includes speech, audio, voice or music generation evidence.

LongCat-AudioDiT: LongCat-AudioDiT is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax Music: MiniMax Music is relevant because its profile includes speech, audio, voice or music generation evidence.

Talkie: Talkie is relevant because its profile includes speech, audio, voice or music generation evidence.

ERNIE / Baidu Qianfan: ERNIE / Baidu Qianfan is relevant because its profile includes speech, audio, voice or music generation evidence.

ByteDance Seed / Doubao Ark: ByteDance Seed / Doubao Ark is relevant because its profile includes speech, audio, voice or music generation evidence.

Seedance 2.0: Seedance 2.0 is relevant because its profile includes speech, audio, voice or music generation evidence.

Seeduplex: Seeduplex is relevant because its profile includes speech, audio, voice or music generation evidence.

Skywork: Skywork is relevant because its profile includes speech, audio, voice or music generation evidence.

SparkDesk / iFlytek Spark: SparkDesk / iFlytek Spark is relevant because its profile includes speech, audio, voice or music generation evidence.

StepFun / Step: StepFun / Step is relevant because its profile includes speech, audio, voice or music generation evidence.

StepFun Open Platform: StepFun Open Platform is relevant because its profile includes speech, audio, voice or music generation evidence.

StepAudio: StepAudio is relevant because its profile includes speech, audio, voice or music generation evidence.

Xiaomi MiMo: Xiaomi MiMo is relevant because its profile includes speech, audio, voice or music generation evidence.

MiMo Speech Models: MiMo Speech Models is relevant because its profile includes speech, audio, voice or music generation evidence.

All use cases