Meituan LongCat
LongCat-AudioDiT
LongCat-AudioDiT is an open-weight diffusion-based text-to-speech model from the Meituan LongCat team. The repository and paper describe a non-autoregressive TTS system that operates directly in waveform latent space instead of mel-spectrograms, reducing pipeline complexity and compounding errors. Its inference path combines a waveform VAE with a diffusion backbone, corrects a training-inference mismatch and uses adaptive projection guidance rather than traditional classifier-free guidance. The repository releases code and model weights for research, provides Hugging Face-compatible implementation and inference scripts, and reports state-of-the-art zero-shot voice cloning results on the Seed benchmark. The largest LongCat-AudioDiT-3.5B model is reported to improve speaker similarity on Seed-ZH and Seed-Hard over the previous Seed-TTS baseline.
Quick answers
At a glance
- Overview
- Meituan LongCat's open-weight diffusion TTS model that operates directly in waveform latent space for high-fidelity voice cloning.
- Best fit
- Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.
- Trust
- 2/2 sources verified, recently checked · 2026-05-18
- Coverage
- 100/100 · backfill: freshness
Editorial verdict
Best for
Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.
Avoid if
Avoid treating it as a turnkey production voice platform until runtime, rights and deployment constraints are validated.
Why it matters
LongCat-AudioDiT belongs in AI Audio because it is a direct-text-to-speech and voice-cloning model with released code and weights, not a generic research paper.
Pricing
Open-source MIT repository and released model weights; inference runs locally or through a Hugging Face-compatible workflow
Payment
GitHub repository, Model weights download, Local inference, Hugging Face-compatible workflow
Commercial use
MIT covers the repo, but voice cloning rights, model weights and generated-audio use still need explicit review.
Privacy
Prompt audio handling, retained voice samples and generated audio storage should be reviewed before production use.
Use-case fit
Zero-shot voice cloning
StrongEvaluate when you need prompt audio plus text to reproduce speaker style and voice similarity.
Research TTS benchmarking
StrongUseful for comparing against Seed-TTS, CosyVoice, Qwen3-TTS and MiniMax speech baselines.
Open-source speech stack experiments
MediumThe model and scripts can be used to study waveform-latent diffusion inference and guidance methods.
Global user checklist
Pros
- - Direct waveform-latent diffusion TTS pipeline simplifies generation
- - Reports SOTA zero-shot voice cloning on the Seed benchmark
- - Code, model weights and inference scripts are public
Cons
- - It is a research model, not a hosted speech API
- - Production use still depends on local GPU capacity, inference tuning and license review
- - Voice cloning and generated audio use still need consent review
Decision paths
minimax-audio
stepaudio
qwen-audio
zhipu-glm-audio
Sources
official · en · verified 2026-05-18
Confirms repository name, MIT license, model description, seed benchmark results, code and weights release, and inference usage.
docs · en · verified 2026-05-18
Source paper for waveform-latent diffusion, APG guidance and test claims.