Meituan LongCat

LongCat-AudioDiT

LongCat-AudioDiT is an open-weight diffusion-based text-to-speech model from the Meituan LongCat team. The repository and paper describe a non-autoregressive TTS system that operates directly in waveform latent space instead of mel-spectrograms, reducing pipeline complexity and compounding errors. Its inference path combines a waveform VAE with a diffusion backbone, corrects a training-inference mismatch and uses adaptive projection guidance rather than traditional classifier-free guidance. The repository releases code and model weights for research, provides Hugging Face-compatible implementation and inference scripts, and reports state-of-the-art zero-shot voice cloning results on the Seed benchmark. The largest LongCat-AudioDiT-3.5B model is reported to improve speaker similarity on Seed-ZH and Seed-Hard over the previous Seed-TTS baseline.

Globally availableFull English UILimited APIFreeTrusted

Quick answers

At a glance

Overview
Meituan LongCat's open-weight diffusion TTS model that operates directly in waveform latent space for high-fidelity voice cloning.
Best fit
Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.
Trust
2/2 sources verified, recently checked · 2026-05-18
Coverage
100/100 · backfill: freshness

Editorial verdict

Best for

Researchers and speech teams evaluating open-source TTS, waveform-latent diffusion and zero-shot voice cloning.

Avoid if

Avoid treating it as a turnkey production voice platform until runtime, rights and deployment constraints are validated.

Why it matters

LongCat-AudioDiT belongs in AI Audio because it is a direct-text-to-speech and voice-cloning model with released code and weights, not a generic research paper.

Pricing

Open-source MIT repository and released model weights; inference runs locally or through a Hugging Face-compatible workflow

Payment

GitHub repository, Model weights download, Local inference, Hugging Face-compatible workflow

Commercial use

MIT covers the repo, but voice cloning rights, model weights and generated-audio use still need explicit review.

Privacy

Prompt audio handling, retained voice samples and generated audio storage should be reviewed before production use.

Use-case fit

Zero-shot voice cloning

Strong

Evaluate when you need prompt audio plus text to reproduce speaker style and voice similarity.

Research TTS benchmarking

Strong

Useful for comparing against Seed-TTS, CosyVoice, Qwen3-TTS and MiniMax speech baselines.

Open-source speech stack experiments

Medium

The model and scripts can be used to study waveform-latent diffusion inference and guidance methods.

Global user checklist

RegistrationConfirmedGitHub repository and released weights are public.
English UIConfirmedThe repository README and paper materials are English-facing.
API and docsPartialInference scripts and HF-compatible usage are documented, but there is no hosted API.
Commercial useReviewMIT covers the repo, but model weights, voice rights and deployment usage still need separate review.
Data and privacy termsReviewVoice cloning and prompt audio handling need consent and retention review.
Coverage · 100/100 · backfill: freshness

Pros

  • - Direct waveform-latent diffusion TTS pipeline simplifies generation
  • - Reports SOTA zero-shot voice cloning on the Seed benchmark
  • - Code, model weights and inference scripts are public

Cons

  • - It is a research model, not a hosted speech API
  • - Production use still depends on local GPU capacity, inference tuning and license review
  • - Voice cloning and generated audio use still need consent review

Decision paths

minimax-audio

stepaudio

qwen-audio

zhipu-glm-audio

Sources

LongCat-AudioDiT GitHub repository

official · en · verified 2026-05-18

Confirms repository name, MIT license, model description, seed benchmark results, code and weights release, and inference usage.

LongCat-AudioDiT paper PDF

docs · en · verified 2026-05-18

Source paper for waveform-latent diffusion, APG guidance and test claims.

Reviews