Joycent: Diffusion-based Accent TTS without Accented Phone Prediction
Paper • 2606.16417 • Published
This repository stores the ParallelWaveGAN vocoder used by Joycent Mandarin accent text-to-speech inference, as presented in the paper Joycent: Diffusion-based Accent TTS without Accented Phone Prediction.
The model generates 16 kHz audio from 80-bin mel spectrograms.
Keep checkpoint-50000steps.pkl and config.yml in the same directory when loading the model with ParallelWaveGAN:
import yaml
from parallel_wavegan.utils import load_model
with open("config.yml", encoding="utf-8") as file:
config = yaml.load(file, Loader=yaml.Loader)
vocoder = load_model("checkpoint-50000steps.pkl", config)
vocoder.remove_weight_norm()
vocoder.eval()
The Joycent implementation and inference instructions are available in the official repository.
@misc{wang2026joycentdiffusionbasedaccenttts,
title={Joycent: Diffusion-based Accent TTS without Accented Phone Prediction},
author={Xintong Wang and Ye Wang},
year={2026},
eprint={2606.16417},
archivePrefix={arXiv},
primaryClass={cs.SD},
}