{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dasheng-AudioGen \u2014 Notebook Demo\n", "\n", "This notebook walks through the audio-generation usage shown in the [README](./README.md). A CUDA-capable GPU is required.\n", "\n", "Each example takes a text description and produces an audio waveform that is saved to disk and played back inline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install torch torchaudio \"transformers<5\" einops" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Usage\n", "\n", "Load the model and generate audio from a single text prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torchaudio\n", "from transformers import AutoModel\n", "from IPython.display import Audio\n", "\n", "model = AutoModel.from_pretrained(\"mispeech/Dasheng-AudioGen\", trust_remote_code=True).cuda()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\"A dog barking in a park\")\n", "torchaudio.save(\"output.wav\", audio.cpu(), 16000)\n", "Audio(\"output.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aspect-wise Prompt\n", "\n", "Use `compose_prompt` to describe different audio aspects separately." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompt = model.compose_prompt(\n", " caption=\"A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone.\",\n", " speech=\"gritty deep male voice\",\n", " music=\"melancholic solo saxophone\",\n", " env=\"distant urban ambience\",\n", " sfx=\"heavy rain hitting pavement\",\n", " asr=\"The city never sleeps, but it sure knows how to cry.\",\n", ")\n", "audio = model.generate(prompt)\n", "torchaudio.save(\"output_detective.wav\", audio.cpu(), 16000)\n", "Audio(\"output_detective.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass a pre-formatted string with tags directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\n", " \"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience.\"\n", ")\n", "torchaudio.save(\"output_helicopter.wav\", audio.cpu(), 16000)\n", "Audio(\"output_helicopter.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch Inference\n", "\n", "Pass a list of prompts to generate multiple audios in a single call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " model.compose_prompt(caption=\"A cat meowing softly.\", sfx=\"Soft cat meow.\"),\n", " model.compose_prompt(caption=\"Thunder rolling in the distance.\", env=\"Stormy night ambience.\"),\n", " model.compose_prompt(caption=\"A piano playing a gentle melody.\", music=\"Soft piano ballad.\"),\n", "]\n", "audios = model.generate(prompts)\n", "\n", "for i, audio in enumerate(audios):\n", " torchaudio.save(f\"output_{i}.wav\", audio.unsqueeze(0).cpu(), 16000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_0.wav\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_1.wav\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_2.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generation Parameters\n", "\n", "Tune the denoising steps, classifier-free guidance scale, and sway sampling coefficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\n", " prompts=\"A dog barking in a park\",\n", " num_steps=25, # number of denoising steps (default: 25)\n", " guidance_scale=5.0, # classifier-free guidance scale (default: 5.0)\n", " sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear)\n", ")\n", "torchaudio.save(\"output_tuned.wav\", audio.cpu(), 16000)\n", "Audio(\"output_tuned.wav\")" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }