Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent is a diffusion-based Mandarin accent text-to-speech (TTS) framework that synthesizes accented speech directly from standard phone sequences and speech references without requiring accented phone prediction. It integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder.

Usage

This repository stores the pretrained Joycent acoustic-model checkpoint (grad_210.pt). The model implementation and full inference instructions are available in the official GitHub repository.

You can download the checkpoint using the following snippet:

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="walston/joycent",
    filename="grad_210.pt",
)

Then, pass the downloaded path to joycent/inference_joycent.py using the --acoustic-checkpoint argument. Note that you will also need the Joycent vocoder for full synthesis.

Citation

If you find Joycent useful for your research, please cite:

@misc{wang2026joycentdiffusionbasedaccenttts,
      title={Joycent: Diffusion-based Accent TTS without Accented Phone Prediction},
      author={Xintong Wang and Ye Wang},
      year={2026},
      eprint={2606.16417},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using walston/joycent 1

Paper for walston/joycent