Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.
Quick answers
At a glance
What it covers
Chinese ASR, TTS, voice, music and realtime audio products for voice agents, dubbing, localization and media workflows.
Matched tools
24 tools currently match this use case.
How to read this page
Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.
Decision standard
Prioritize products with explicit ASR, TTS, voice cloning, music generation, realtime audio or speech-model evidence.
24 matched tools
Kuaishou
Kling AI
4.7
Kling has an English-facing site and a broader creative and API platform beyond text-to-video generation.
Best fit · Creators, studios and growth teams that want a globally accessible Chinese creative studio for video, image, sound, effects and API-backed generation.
Coverage · 100/100
Globally availableFull English UITrustedPublic APIFreemium
LongCat-AudioDiT belongs in AI Audio because it is a direct-text-to-speech and voice-cloning model with released code and weights, not a generic research paper.
Best fit · Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.
Coverage · 100/100 · backfill: freshness
Globally availableFull English UITrustedLimited APIFree
Payment
GitHub repository / Model weights download
Checked
May 18
Sources
High confidence
From
Open-source MIT repository and released model weights; inference runs locally or through a Hugging Face-compatible workflow
Talkie belongs in the MiniMax international product map because the official site lists it as a product, but it is a consumer app rather than an API surface.
Best fit · Users comparing MiniMax's consumer character and companion AI distribution.
Coverage · 100/100
Partially availableFull English UITrustedNo APIFreemium
Baidu now has English model communications through the ERNIE Blog, while Qianfan remains the main platform when enterprise platform, agent orchestration and China-cloud deployment matter.
Best fit · Enterprises and developers already evaluating Baidu Cloud, China-local deployment, agent platforms or ERNIE multimodal models.
Coverage · 100/100
Partially availablePartial English UITrustedPublic APIPaid
ByteDance Seed is now a broad model portfolio rather than a single Doubao API entry, so it should be tracked as a foundation-model and model-platform family.
Best fit · Developers and teams comparing ByteDance's English-facing Seed model roadmap with commercial Doubao/Ark API access.
Coverage · 100/100
Partially availableFull English UITrustedPublic APIPaid
Seedance 2.0 is ByteDance Seed's named video model and provides a direct way to track video capability instead of only through Jimeng or generic Doubao/Ark.
Best fit · Creators and developers comparing Chinese video models with multimodal input, audio-video generation and API access.
Coverage · 100/100
Partially availableFull English UITrustedPublic APIPaid
Payment
BytePlus billing / Volcano Engine billing
Checked
May 17
Sources
High confidence
From
API and Try Now access are linked from the official page; pricing should be checked in BytePlus or Volcano Engine
Skywork should be tracked as a workspace platform because the public surfaces are organized around task-specific agents and output formats, not one generic chat flow.
Best fit · Knowledge workers who want one cloud workspace for research, writing, slides, sheets, websites and short-form media output.
Coverage · 100/100 · backfill: access signals
Globally availableFull English UITrustedUnknownFreemium
StepFun is important because it combines multimodal model depth, open-source releases and device commercialization, but overseas usability still needs hands-on checks.
Best fit · Teams evaluating Chinese multimodal models, open-source agent models, video/audio generation or device-side AI partnerships.
Coverage · 100/100
Partially availableFull English UITrustedPublic APIPaid