| | --- |
| | language: |
| | - en |
| | license: cc-by-4.0 |
| | pipeline_tag: text-to-speech |
| | tags: |
| | - voxtream |
| | - text-to-speech |
| | --- |
| | |
| | # Model Card for VoXtream |
| |
|
| | VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word. |
| |
|
| | ### Key features |
| |
|
| | - **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks. |
| | - **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU. |
| | - **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets. |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** [repo](https://github.com/herimor/voxtream) |
| | - **Paper:** [paper](https://arxiv.org/pdf/2509.15969) |
| | - **Demo:** [demo](https://herimor.github.io/voxtream) |
| |
|
| | ## Get started |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install voxtream |
| | ``` |
| |
|
| | ### Usage |
| |
|
| | #### Output streaming |
| | ```bash |
| | voxtream \ |
| | --prompt-audio assets/audio/male.wav \ |
| | --prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \ |
| | --text "In general, however, some method is then needed to evaluate each approximation." \ |
| | --output "output_stream.wav" |
| | ``` |
| | * Note: Initial run may take some time to download model weights. |
| |
|
| | #### Full streaming |
| | ```bash |
| | voxtream \ |
| | --prompt-audio assets/audio/female.wav \ |
| | --prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \ |
| | --text "Staff do not always do enough to prevent violence." \ |
| | --output "full_stream.wav" \ |
| | --full-stream |
| | ``` |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws. |
| |
|
| | ## Training Data |
| |
|
| | The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. You can download it [here](https://huggingface.co/datasets/herimor/voxtream-train-9k). For more details, please check our paper. |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @article{torgashov2025voxtream, |
| | author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel}, |
| | title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency}, |
| | journal = {arXiv:2509.15969}, |
| | year = {2025} |
| | } |
| | ``` |