File size: 5,521 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dasheng-AudioGen \u2014 Notebook Demo\n",
    "\n",
    "This notebook walks through the audio-generation usage shown in the [README](./README.md). A CUDA-capable GPU is required.\n",
    "\n",
    "Each example takes a text description and produces an audio waveform that is saved to disk and played back inline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install torch torchaudio \"transformers<5\" einops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "Load the model and generate audio from a single text prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from transformers import AutoModel\n",
    "from IPython.display import Audio\n",
    "\n",
    "model = AutoModel.from_pretrained(\"mispeech/Dasheng-AudioGen\", trust_remote_code=True).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\"A dog barking in a park\")\n",
    "torchaudio.save(\"output.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aspect-wise Prompt\n",
    "\n",
    "Use `compose_prompt` to describe different audio aspects separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = model.compose_prompt(\n",
    "    caption=\"A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone.\",\n",
    "    speech=\"gritty deep male voice\",\n",
    "    music=\"melancholic solo saxophone\",\n",
    "    env=\"distant urban ambience\",\n",
    "    sfx=\"heavy rain hitting pavement\",\n",
    "    asr=\"The city never sleeps, but it sure knows how to cry.\",\n",
    ")\n",
    "audio = model.generate(prompt)\n",
    "torchaudio.save(\"output_detective.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_detective.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass a pre-formatted string with tags directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    \"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience.\"\n",
    ")\n",
    "torchaudio.save(\"output_helicopter.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_helicopter.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Inference\n",
    "\n",
    "Pass a list of prompts to generate multiple audios in a single call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    model.compose_prompt(caption=\"A cat meowing softly.\", sfx=\"Soft cat meow.\"),\n",
    "    model.compose_prompt(caption=\"Thunder rolling in the distance.\", env=\"Stormy night ambience.\"),\n",
    "    model.compose_prompt(caption=\"A piano playing a gentle melody.\", music=\"Soft piano ballad.\"),\n",
    "]\n",
    "audios = model.generate(prompts)\n",
    "\n",
    "for i, audio in enumerate(audios):\n",
    "    torchaudio.save(f\"output_{i}.wav\", audio.unsqueeze(0).cpu(), 16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_0.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_1.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_2.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generation Parameters\n",
    "\n",
    "Tune the denoising steps, classifier-free guidance scale, and sway sampling coefficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    prompts=\"A dog barking in a park\",\n",
    "    num_steps=25,              # number of denoising steps (default: 25)\n",
    "    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)\n",
    "    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)\n",
    ")\n",
    "torchaudio.save(\"output_tuned.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_tuned.wav\")"
   ]
  }
 ],
"metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}