File size: 5,521 Bytes
c38c9dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7bacba
 
 
 
 
 
c38c9dd
d7bacba
c38c9dd
 
 
 
d7bacba
 
 
 
 
 
c38c9dd
d7bacba
 
 
c38c9dd
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dasheng-AudioGen \u2014 Notebook Demo\n",
    "\n",
    "This notebook walks through the audio-generation usage shown in the [README](./README.md). A CUDA-capable GPU is required.\n",
    "\n",
    "Each example takes a text description and produces an audio waveform that is saved to disk and played back inline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install torch torchaudio \"transformers<5\" einops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "Load the model and generate audio from a single text prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from transformers import AutoModel\n",
    "from IPython.display import Audio\n",
    "\n",
    "model = AutoModel.from_pretrained(\"mispeech/Dasheng-AudioGen\", trust_remote_code=True).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\"A dog barking in a park\")\n",
    "torchaudio.save(\"output.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aspect-wise Prompt\n",
    "\n",
    "Use `compose_prompt` to describe different audio aspects separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = model.compose_prompt(\n",
    "    caption=\"A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone.\",\n",
    "    speech=\"gritty deep male voice\",\n",
    "    music=\"melancholic solo saxophone\",\n",
    "    env=\"distant urban ambience\",\n",
    "    sfx=\"heavy rain hitting pavement\",\n",
    "    asr=\"The city never sleeps, but it sure knows how to cry.\",\n",
    ")\n",
    "audio = model.generate(prompt)\n",
    "torchaudio.save(\"output_detective.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_detective.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass a pre-formatted string with tags directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    \"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience.\"\n",
    ")\n",
    "torchaudio.save(\"output_helicopter.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_helicopter.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Inference\n",
    "\n",
    "Pass a list of prompts to generate multiple audios in a single call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    model.compose_prompt(caption=\"A cat meowing softly.\", sfx=\"Soft cat meow.\"),\n",
    "    model.compose_prompt(caption=\"Thunder rolling in the distance.\", env=\"Stormy night ambience.\"),\n",
    "    model.compose_prompt(caption=\"A piano playing a gentle melody.\", music=\"Soft piano ballad.\"),\n",
    "]\n",
    "audios = model.generate(prompts)\n",
    "\n",
    "for i, audio in enumerate(audios):\n",
    "    torchaudio.save(f\"output_{i}.wav\", audio.unsqueeze(0).cpu(), 16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_0.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_1.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_2.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generation Parameters\n",
    "\n",
    "Tune the denoising steps, classifier-free guidance scale, and sway sampling coefficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    prompts=\"A dog barking in a park\",\n",
    "    num_steps=25,              # number of denoising steps (default: 25)\n",
    "    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)\n",
    "    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)\n",
    ")\n",
    "torchaudio.save(\"output_tuned.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_tuned.wav\")"
   ]
  }
 ],
"metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}