Models (Text-to-Audio)
Music & Sound Generation - Best Open Source models (MusicGen, Stable Audio, etc.)
Text-to-Audio ⢠Updated ⢠30.9k ⢠1.43kNote **Stable Audio Open** ā Open-source latent diffusion model for text-to-audio and audio-to-audio generation, trained on a permissively licensed dataset. Generates up to 47s of 44.1kHz stereo audio. š **Paper:** [Stable Audio Open](https://arxiv.org/abs/2407.14358) (arXiv 2024) š **Benchmark:** FD=17.24, KL=2.58, CLAP=0.49 on AudioCaps; competitive on text-to-music tasks š» **GitHub:** https://github.com/Stability-AI/stable-audio-tools
facebook/musicgen-large
Text-to-Audio ⢠Updated ⢠24.3k ⢠525Note **MusicGen Large** ā Autoregressive transformer (3.3B) conditioned on text and/or melody for high-quality music generation. Supports mono/stereo output up to 30s at 32kHz. Part of Meta's AudioCraft suite. š **Paper:** [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) (NeurIPS 2023) š **Benchmark:** MusicCaps FAD=3.8, KLD=0.94 (large model achieves best quality among MusicGen variants) š» **GitHub:** https://github.com/facebookresearch/audiocraft
ACE-Step/ACE-Step-v1-3.5B
Text-to-Audio ⢠Updated ⢠722Note **ACE-Step** ā Flow matching-based foundation music generation model (3.5B) supporting text-to-music, lyrics-to-song, and controllable generation with fast inference speed. Developed by ACE Studio. š **Paper:** [ACE-Step: A Step Towards Music Generation Foundation Model](https://arxiv.org/abs/2505.00233) (arXiv 2025) š **Benchmark:** MusicCaps FAD=1.92, KL=1.18, CLAP=0.32 ā SOTA on music generation at release (2025) š» **GitHub:** https://github.com/ace-step/ACE-Step
facebook/musicgen-melody
Text-to-Audio ⢠2B ⢠Updated ⢠4.96k ⢠251Note **MusicGen Melody** ā Melody-conditioned music generation model (1.5B) that generates music matching a reference melody or chroma features alongside a text prompt. Enables highly controllable music creation. š **Paper:** [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) (NeurIPS 2023) š **Benchmark:** MusicCaps FAD=4.93 (melody-conditioned); best controllability for melody-guided generation among all open models š» **GitHub:** https://github.com/facebookresearch/aud
declare-lab/TangoFlux
Text-to-Audio ⢠Updated ⢠347 ⢠105Note **TangoFlux** ā Efficient text-to-audio generation using FLUX architecture (flow matching transformer) with CLAP-ranked preference optimization. Generates up to 30s of high-quality audio with fast inference. š **Paper:** [TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and CLAP-Ranked Preference Optimization](https://arxiv.org/abs/2412.21037) (arXiv 2024) š **Benchmark:** AudioCaps FD=7.8, KL=1.22, CLAP=0.52 ā best CLAP alignment score among open models at relea
cvssp/audioldm2-large
Updated ⢠230k ⢠18Note **AudioLDM 2** ā Latent diffusion model for text-to-audio/music generation via self-supervised pretraining (1.5B params, large variant). š **Paper:** [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) (TASLP 2024) š **Benchmark:** AudioCaps FD=5.83, KL=1.38, CLAP=0.43 (SOTA on text-to-audio at release) š» **GitHub:** https://github.com/haoheliu/audioldm2
declare-lab/tango2
Text-to-Audio ⢠Updated ⢠31 ⢠18Note **Tango 2** ā Text-to-audio diffusion model aligned with Direct Preference Optimization (DPO) on audio-alpaca pairwise preference data, achieving better prompt adherence. š **Paper:** [Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization](https://arxiv.org/abs/2404.09956) (arXiv 2024) š **Benchmark:** AudioCaps FD=12.05, KL=1.31, CLAP=0.47, OVL=3.97, REL=4.00 (outperforms Tango on all metrics) š» **GitHub:** https://github.com/declare-lab/ta
facebook/audiogen-medium
Updated ⢠26.1k ⢠141Note **AudioGen** ā Autoregressive transformer language model (1.5B) for text-to-general-audio (sound effects) generation using EnCodec discrete tokens. Part of Meta's AudioCraft suite. š **Paper:** [AudioGen: Textually Guided Audio Generation](https://arxiv.org/abs/2209.15352) (ICLR 2023) š **Benchmark:** AudioCaps FD=16.8, KL=2.06, FAD=2.86 (competitive at release on general audio) š» **GitHub:** https://github.com/facebookresearch/audiocraft
OpenSound/EzAudio
Text-to-Audio ⢠Updated ⢠7Note **EzAudio** ā Efficient text-to-audio generation using an optimized Diffusion Transformer (DiT-based) architecture with CFG rescaling and synthetic caption pretraining for high-quality sound effects. (Tencent AI Lab & Johns Hopkins) š **Paper:** [EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://arxiv.org/abs/2409.10819) (arXiv 2024) š **Benchmark:** AudioCaps FD=15.96, KL=1.34, CLAP=0.49 ā outperforms AudioLDM2 on CLAP score with fewer parameters š» **