dsidonsamples / index.html
Wataru's picture
Update index.html
f047e32 verified
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DialogueSidon — Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</title>
<meta name="description" content="Demo page for DialogueSidon: joint restoration and separation of degraded two-speaker dialogue audio via an SSL-VAE latent space and a diffusion-based latent predictor." />
<link rel="stylesheet" href="style.css" />
</head>
<body>
<header class="hero">
<div class="container">
<h1>DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio</h1>
<p class="authors">
Wataru Nakata<sup>1,2</sup>,
Yuki Saito<sup>1,2</sup>,
Kazuki Yamauchi<sup>1</sup>,
Emiru Tsunoo<sup>1</sup>,
Hiroshi Saruwatari<sup>1</sup>
</p>
<p class="affiliation">
<sup>1</sup>The University of Tokyo, Tokyo, Japan&nbsp;&nbsp;
<sup>2</sup>National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
</p>
<nav class="actions">
<a class="btn" href="https://arxiv.org/abs/2604.09344" target="_blank">Paper</a>
<a class="btn" href="https://huggingface.co/spaces/sarulab-speech/DialogueSidon-demo" target="_blank" rel="noopener">Live Demo</a>
</nav>
</div>
</header>
<main class="container">
<section id="abstract">
<h2>Abstract</h2>
<p>
Full-duplex dialogue audio, in which each speaker is recorded on a separate track,
is an important resource for spoken dialogue research, but is difficult to collect at
scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural
mixtures, which is unsuitable for systems requiring clean speaker-wise signals.
We propose <em>DialogueSidon</em>, a model for joint restoration and separation of
degraded two-speaker dialogue audio. DialogueSidon combines an SSL-VAE&mdash;which
compresses self-supervised speech features into a compact latent space&mdash;with a
diffusion-based latent predictor that recovers speaker-wise latent representations
from the degraded mixture. Experiments on English, multilingual, and in-the-wild
dialogue datasets show that DialogueSidon substantially improves intelligibility and
separation quality over a baseline, while also achieving much faster inference.
</p>
</section>
<section id="samples">
<h2>Audio Samples</h2>
<p class="note">
Each row plays the same utterance through three systems. The <strong>noisy</strong>
column is the raw monaural input given to every model. <strong>GENESES</strong> is the
baseline. <strong>DialogueSidon</strong> is ours (D = 32). Separated outputs are encoded
as stereo: speaker 1 on the left channel, speaker 2 on the right.
</p>
<h3>English &mdash; Switchboard</h3>
<div class="sample-table-wrapper">
<table class="sample-table">
<thead>
<tr>
<th>Example</th>
<th>Noisy mixture</th>
<th>GENESES</th>
<th>DialogueSidon (ours)</th>
</tr>
</thead>
<tbody>
<tr>
<td>sw02007</td>
<td><div class="waveform" data-src="wav/swb/noisy/sw02007.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/geneses/sw02007.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02007.wav"></div></td>
</tr>
<tr>
<td>sw02093</td>
<td><div class="waveform" data-src="wav/swb/noisy/sw02093.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/geneses/sw02093.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02093.wav"></div></td>
</tr>
<tr>
<td>sw02157</td>
<td><div class="waveform" data-src="wav/swb/noisy/sw02157.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/geneses/sw02157.wav"></div></td>
<td><div class="waveform" data-src="wav/swb/dialoguesidon/sw02157.wav"></div></td>
</tr>
</tbody>
</table>
</div>
<h3>Multilingual &mdash; CallFriend</h3>
<div class="sample-table-wrapper">
<table class="sample-table">
<thead>
<tr>
<th>Language</th>
<th>Noisy mixture</th>
<th>GENESES</th>
<th>DialogueSidon (ours)</th>
</tr>
</thead>
<tbody>
<tr>
<td>German</td>
<td><div class="waveform" data-src="wav/cf/noisy/deu_1082.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/deu_1082.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/deu_1082.wav"></div></td>
</tr>
<tr>
<td>English</td>
<td><div class="waveform" data-src="wav/cf/noisy/eng-n_4708.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/eng-n_4708.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/eng-n_4708.wav"></div></td>
</tr>
<tr>
<td>French</td>
<td><div class="waveform" data-src="wav/cf/noisy/fra-q_5110.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/fra-q_5110.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/fra-q_5110.wav"></div></td>
</tr>
<tr>
<td>Japanese</td>
<td><div class="waveform" data-src="wav/cf/noisy/jpn_0921.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/jpn_0921.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/jpn_0921.wav"></div></td>
</tr>
<tr>
<td>Spanish</td>
<td><div class="waveform" data-src="wav/cf/noisy/spa_1469.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/spa_1469.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/spa_1469.wav"></div></td>
</tr>
<tr>
<td>Mandarin</td>
<td><div class="waveform" data-src="wav/cf/noisy/zho-m_0941.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/geneses/zho-m_0941.wav"></div></td>
<td><div class="waveform" data-src="wav/cf/dialoguesidon/zho-m_0941.wav"></div></td>
</tr>
</tbody>
</table>
</div>
<h3>In-the-Wild &mdash; OpenDialog</h3>
<p class="note">
Real internet dialogue recordings with realistic, unknown degradations.
No clean reference exists for these clips.
</p>
<div class="sample-table-wrapper">
<table class="sample-table">
<thead>
<tr>
<th>Example</th>
<th>Noisy mixture</th>
<th>GENESES</th>
<th>DialogueSidon (ours)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Example 1</td>
<td><div class="waveform" data-src="wav/od/noisy/example_1.wav"></div></td>
<td><div class="waveform" data-src="wav/od/geneses/example_1.wav"></div></td>
<td><div class="waveform" data-src="wav/od/dialoguesidon/example_1.wav"></div></td>
</tr>
<tr>
<td>Example 2</td>
<td><div class="waveform" data-src="wav/od/noisy/example_2.wav"></div></td>
<td><div class="waveform" data-src="wav/od/geneses/example_2.wav"></div></td>
<td><div class="waveform" data-src="wav/od/dialoguesidon/example_2.wav"></div></td>
</tr>
<tr>
<td>Example 3</td>
<td><div class="waveform" data-src="wav/od/noisy/example_3.wav"></div></td>
<td><div class="waveform" data-src="wav/od/geneses/example_3.wav"></div></td>
<td><div class="waveform" data-src="wav/od/dialoguesidon/example_3.wav"></div></td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="bibtex">
<h2>Citation</h2>
<pre class="bibtex"><code>[BibTeX entry will be provided upon publication.]</code></pre>
</section>
</main>
<footer>
<div class="container">
<p>Demo page accompanying the DialogueSidon preprint.
Code will be released upon publication.</p>
</div>
</footer>
<script type="module" src="script.js"></script>
</body>
</html>