Text-to-Audio
Transformers
Safetensors
English
dasheng_audiogen
feature-extraction
audio-generation
text-to-speech
text-to-music
sound-effects
diffusion
custom_code
Instructions to use mispeech/Dasheng-AudioGen with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mispeech/Dasheng-AudioGen with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mispeech/Dasheng-AudioGen", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - audio-generation | |
| - text-to-audio | |
| - text-to-speech | |
| - text-to-music | |
| - sound-effects | |
| - diffusion | |
| library_name: transformers | |
| pipeline_tag: text-to-audio | |
| # Dasheng-AudioGen | |
| [](https://arxiv.org/abs/2605.27838) | |
| [](https://github.com/xiaomi-research/dasheng-audiogen) | |
| [](https://huggingface.co/mispeech/Dasheng-AudioGen) | |
| [](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen) | |
| [](https://nieeim.github.io/Dasheng-AudioGen-Web/) | |
| <!-- [](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen/resolve/main/notebook.ipynb) --> | |
| [**English**](./README.md) | [**中文**](./README_zh.md) | |
| **Dasheng-AudioGen** is a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions. | |
| <p align="center"> | |
| <video | |
| src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900" | |
| controls | |
| autoplay | |
| muted | |
| loop | |
| playsinline | |
| width="85%"> | |
| </video> | |
| </p> | |
| ## Models | |
| | Model | HuggingFace | Text Encoder | Language | | |
| |-------|-------------|-------------|:--------:| | |
| | Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English | | |
| | Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual | | |
| ## Installation | |
| ```bash | |
| pip install torch torchaudio "transformers<5" einops | |
| ``` | |
| > Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x. | |
| ## Prompt Format | |
| Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed. | |
| | Tag | Description | Required | | |
| |-----|-------------|:--------:| | |
| | `<\|caption\|>` | Overall audio scene description | Yes | | |
| | `<\|speech\|>` | Speaker identity and speaking style | No | | |
| | `<\|asr\|>` | Spoken transcript / dialogue | No | | |
| | `<\|sfx\|>` | Sound effects | No | | |
| | `<\|music\|>` | Background music | No | | |
| | `<\|env\|>` | Environmental ambience | No | | |
| **Rules:** | |
| - The prompt must begin with `<|caption|>` — prompts without it will be rejected. | |
| - Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music). | |
| > **Multilingual note:** When using the multilingual model, all descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be in **English**. Only the `<|asr|>` field (the actual speech content to synthesize) uses the target language. | |
| ## Quick Start | |
| ### Usage 1: Aspect-wise Composition | |
| Pass each aspect as a named argument. The `caption` field is required; all other fields are optional. | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| caption="A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone.", | |
| speech="gritty deep male voice", | |
| music="melancholic solo saxophone", | |
| env="distant urban ambience", | |
| sfx="heavy rain hitting pavement", | |
| asr="The city never sleeps, but it sure knows how to cry.", | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### Usage 2: Pre-formatted Prompt String | |
| Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`. | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| prompt="<|caption|> A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone. <|speech|> gritty deep male voice <|asr|> The city never sleeps, but it sure knows how to cry. <|sfx|> heavy rain hitting pavement <|music|> melancholic solo saxophone <|env|> distant urban ambience" | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### Batch Inference | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True).cuda() | |
| prompts = [ | |
| model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."), | |
| model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."), | |
| model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."), | |
| ] | |
| audios = model.generate(prompts) | |
| for i, audio in enumerate(audios): | |
| torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000) | |
| ``` | |
| ### Generation Parameters | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt(caption="A dog barking in a park") | |
| audio = model.generate( | |
| prompts=prompt, | |
| num_steps=25, # number of denoising steps (default: 25) | |
| guidance_scale=5.0, # classifier-free guidance scale (default: 5.0) | |
| sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear) | |
| ) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ## Acknowledgments | |
| Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**. | |
| ## Citation | |
| ```bibtex | |
| @article{mei2026dashengaudiogen, | |
| title = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text}, | |
| author = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu}, | |
| journal = {arXiv preprint arXiv:2605.27838}, | |
| year = {2026} | |
| } | |
| ``` | |
| ## License | |
| This project is released under the [Apache License 2.0](LICENSE). | |