Chinese AI tools for speech and audio

Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.

Quick answers

At a glance

What it covers
Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.
Matched tools
24 tools currently match this use case.
How to read this page
Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.

Decision standard

Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.

24 matched tools

Kuaishou

Kling AI

4.7

Kling has an English-facing site and a broader creative and API platform beyond text-to-video generation.

Best fit · Creators, studios and growth teams that want a globally accessible Chinese creative studio for video, image, sound, effects and API-backed generation.

Coverage · 100/100

Globally availableFull English UITrusted

Alibaba Cloud

Qwen

4.6

Qwen Cloud makes Qwen available for global evaluation through an English marketplace, docs, pricing and compatible API paths.

Best fit · Developers evaluating Qwen3.6, Qwen Cloud APIs, coding agents and multimodal Chinese model coverage from an English international platform.

Coverage · 100/100

Globally availablePartial English UITrusted

Alibaba Cloud

Qwen Audio / CosyVoice

4.2

Qwen Cloud has enough official audio evidence to warrant a separate audio-category profile.

Best fit · Teams evaluating Chinese speech synthesis, voice cloning, ASR and realtime speech APIs through an English platform.

Coverage · 100/100

Globally availableFull English UITrusted

Alibaba Cloud

Qwen Cloud Token Plan

4.0

Token Plan is a distinct commercial route for Qwen Cloud and affects how overseas developers actually consume the models.

Best fit · Developers who want a subscription-style Qwen Cloud route for coding tools and compatible agents.

Coverage · 96/100 · backfill: sources

Globally availableFull English UIPartial evidence

Zhipu AI

Z.ai BigModel / GLM

4.4

Z.ai now has an English product surface, while BigModel remains the API evidence base for the full GLM product line.

Best fit · Developers comparing Chinese multimodal model APIs, agent services and OpenAI-compatible migration paths.

Coverage · 100/100

Partially availablePartial English UITrusted

Zhipu AI

GLM Audio

3.9

Audio is now a documented GLM capability family and should be visible in the audio category.

Best fit · Developers evaluating Chinese speech, voice clone, ASR and realtime multimodal APIs.

Coverage · 100/100

Partially availablePartial English UITrusted

Ant Group

Ant Ling

4.2

Ant Ling now has enough English-facing product, docs, pricing and integration evidence to be tracked alongside Qwen, DeepSeek, Kimi and GLM.

Best fit · Developers evaluating Chinese model APIs for long context, coding agents, reasoning models and OpenAI/Anthropic-compatible migration.

Coverage · 100/100

Partially availableFull English UITrusted

Ant Group

Ming

4.0

Ming is the multimodal branch of Ant Ling and deserves separate tracking from text-only Ling and reasoning-focused Ring.

Best fit · Teams tracking open Chinese full-modal models across image-text understanding, video analysis, speech synthesis and image generation.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrusted

MiniMax

MiniMax API Platform

4.4

The international docs show MiniMax as a full multimodal API platform rather than only a Hailuo video product.

Best fit · Developers who want one China-origin platform for coding models, speech, video, image, music and multimodal agent tooling.

Coverage · 100/100

Globally availableFull English UITrusted

MiniMax

MiniMax Audio / Speech

4.3

MiniMax Audio deserves a separate profile because the official API docs cover a mature speech product line beyond general model chat.

Best fit · Teams evaluating Chinese speech synthesis, voice cloning and multilingual audio generation APIs.

Coverage · 100/100

Globally availableFull English UITrusted

Meituan LongCat

LongCat-AudioDiT

4.2

LongCat-AudioDiT belongs in AI Audio because it is a direct-text-to-speech and voice-cloning model with released code and weights, not a generic research paper.

Best fit · Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.

Coverage · 100/100 · backfill: freshness

Globally availableFull English UITrusted

MiniMax

MiniMax Music

4.2

MiniMax Music is a distinct international product line in the official docs and should not be hidden inside a generic API profile.

Best fit · Creators and developers evaluating Chinese music generation APIs for songs, covers and app soundtracks.

Coverage · 100/100

Globally availableFull English UITrusted

MiniMax

Talkie

4.0

Talkie belongs in the MiniMax international product map because the official site lists it as a product, but it is a consumer app rather than an API surface.

Best fit · Users comparing MiniMax's consumer character and companion AI distribution.

Coverage · 100/100

Partially availableFull English UITrusted

Baidu AI Cloud

ERNIE / Baidu Qianfan

4.1

Baidu now has English model communications through the ERNIE Blog, while Qianfan remains the main platform when enterprise platform, agent orchestration and China-cloud deployment matter.

Best fit · Enterprises and developers already evaluating Baidu Cloud, China-local deployment, agent platforms or ERNIE multimodal models.

Coverage · 100/100

Partially availablePartial English UITrusted

ByteDance / Volcano Engine

ByteDance Seed / Doubao Ark

4.2

ByteDance Seed is now a broad model portfolio rather than a single Doubao API entry, so it should be tracked as a foundation-model and model-platform family.

Best fit · Developers and teams comparing ByteDance's English-facing Seed model roadmap with commercial Doubao/Ark API access.

Coverage · 100/100

Partially availableFull English UITrusted

ByteDance / Volcano Engine

Seedance 2.0

4.4

Seedance 2.0 is ByteDance Seed's named video model and provides a direct way to track video capability instead of only through Jimeng or generic Doubao/Ark.

Best fit · Creators and developers comparing Chinese video models with multimodal input, audio-video generation and API access.

Coverage · 100/100

Partially availableFull English UITrusted

ByteDance / Volcano Engine

Seeduplex

3.9

Seeduplex gives ByteDance Seed a distinct voice-interaction profile beyond text, image and video models.

Best fit · Teams tracking Chinese full-duplex speech models, realtime voice agents and multimodal interaction research.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrusted

Skywork AI

Skywork

4.2

Skywork should be tracked as a workspace platform because the public surfaces are organized around task-specific agents and output formats, not one generic chat flow.

Best fit · Knowledge workers who want one cloud workspace for research, writing, slides, sheets, websites and short-form media output.

Coverage · 100/100 · backfill: access signals

Globally availableFull English UITrusted

iFlytek

SparkDesk / iFlytek Spark

4.0

Spark is strategically important for speech and vertical applications, while overseas consumer readiness remains limited.

Best fit · Teams evaluating Chinese speech AI, education, healthcare or voice-heavy assistant workflows.

Coverage · 98/100 · backfill: sources

Partially availablePartial English UIPartial evidence

StepFun

StepFun / Step

4.3

StepFun is important because it combines multimodal model depth, open-source releases and device commercialization, but overseas usability still needs hands-on checks.

Best fit · Teams evaluating Chinese multimodal models, open-source agent models, video/audio generation or device-side AI partnerships.

Coverage · 100/100

Partially availableFull English UITrusted

StepFun

StepFun Open Platform

4.2

The English platform makes StepFun more actionable for overseas developers than a company-only profile.

Best fit · Developers comparing Chinese model APIs for text, reasoning, tool calling, multimodal generation and OpenAI-compatible migration.

Coverage · 100/100

Partially availableFull English UITrusted

StepFun

StepAudio

4.1

StepAudio is a distinct capability line and should be visible in the AI Audio category, not hidden under the generic StepFun profile.

Best fit · Teams evaluating Chinese speech APIs for expressive TTS, voice cloning, dubbing, customer service, NPC dialogue and transcription.

Coverage · 100/100 · backfill: freshness

Partially availableFull English UITrusted

Xiaomi MiMo

Xiaomi MiMo

4.5

MiMo matters because Xiaomi is exposing an English-facing model platform across web, API, AI Studio and open-source channels.

Best fit · Developers comparing Chinese agent models, multimodal model families, speech models, English web demos and open-source deployment options.

Coverage · 100/100

Globally availableFull English UITrusted

Xiaomi MiMo

MiMo Speech Models

4.0

MiMo now has enough English-facing speech signals to deserve a separate audio profile.

Best fit · Teams watching Xiaomi's speech stack for ASR, TTS and voice-agent experiments.

Coverage · 100/100 · backfill: pricing

Partially availableFull English UITrusted

Why these tools qualify

Kling AI: Kling AI is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen: Qwen is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen Audio / CosyVoice: Qwen Audio / CosyVoice is relevant because its profile includes speech, audio, voice or music generation evidence.

Qwen Cloud Token Plan: Qwen Cloud Token Plan is relevant because its profile includes speech, audio, voice or music generation evidence.

Z.ai BigModel / GLM: Z.ai BigModel / GLM is relevant because its profile includes speech, audio, voice or music generation evidence.

GLM Audio: GLM Audio is relevant because its profile includes speech, audio, voice or music generation evidence.

Ant Ling: Ant Ling is relevant because its profile includes speech, audio, voice or music generation evidence.

Ming: Ming is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax API Platform: MiniMax API Platform is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax Audio / Speech: MiniMax Audio / Speech is relevant because its profile includes speech, audio, voice or music generation evidence.

LongCat-AudioDiT: LongCat-AudioDiT is relevant because its profile includes speech, audio, voice or music generation evidence.

MiniMax Music: MiniMax Music is relevant because its profile includes speech, audio, voice or music generation evidence.

Talkie: Talkie is relevant because its profile includes speech, audio, voice or music generation evidence.

ERNIE / Baidu Qianfan: ERNIE / Baidu Qianfan is relevant because its profile includes speech, audio, voice or music generation evidence.

ByteDance Seed / Doubao Ark: ByteDance Seed / Doubao Ark is relevant because its profile includes speech, audio, voice or music generation evidence.

Seedance 2.0: Seedance 2.0 is relevant because its profile includes speech, audio, voice or music generation evidence.

Seeduplex: Seeduplex is relevant because its profile includes speech, audio, voice or music generation evidence.

Skywork: Skywork is relevant because its profile includes speech, audio, voice or music generation evidence.

SparkDesk / iFlytek Spark: SparkDesk / iFlytek Spark is relevant because its profile includes speech, audio, voice or music generation evidence.

StepFun / Step: StepFun / Step is relevant because its profile includes speech, audio, voice or music generation evidence.

StepFun Open Platform: StepFun Open Platform is relevant because its profile includes speech, audio, voice or music generation evidence.

StepAudio: StepAudio is relevant because its profile includes speech, audio, voice or music generation evidence.

Xiaomi MiMo: Xiaomi MiMo is relevant because its profile includes speech, audio, voice or music generation evidence.

MiMo Speech Models: MiMo Speech Models is relevant because its profile includes speech, audio, voice or music generation evidence.

All use cases