Duplicated from herimor/voxtream

Alignment-Lab-AI
/

vstr

Model card Files Files and versions

vstr / README.md

Alignment-Lab-AI's picture

Alignment-Lab-AI

Duplicate from herimor/voxtream

1e093a4 verified 5 months ago

|

history blame contribute delete

2.7 kB

	---
	language:
	- en
	license: cc-by-4.0
	pipeline_tag: text-to-speech
	tags:
	- voxtream
	- text-to-speech
	---

	# Model Card for VoXtream

	VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

	### Key features

	- Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
	- Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
	- Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

	### Model Sources

	- Repository: [repo](https://github.com/herimor/voxtream)
	- Paper: [paper](https://arxiv.org/pdf/2509.15969)
	- Demo: [demo](https://herimor.github.io/voxtream)

	## Get started

	### Installation

	```bash
	pip install voxtream
	```

	### Usage

	#### Output streaming
	```bash
	voxtream \
	--prompt-audio assets/audio/male.wav \
	--prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \
	--text "In general, however, some method is then needed to evaluate each approximation." \
	--output "output_stream.wav"
	```
	* Note: Initial run may take some time to download model weights.

	#### Full streaming
	```bash
	voxtream \
	--prompt-audio assets/audio/female.wav \
	--prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \
	--text "Staff do not always do enough to prevent violence." \
	--output "full_stream.wav" \
	--full-stream
	```

	### Out-of-Scope Use

	Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

	## Training Data

	The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. You can download it [here](https://huggingface.co/datasets/herimor/voxtream-train-9k). For more details, please check our paper.

	## Citation

	```
	@article{torgashov2025voxtream,
	author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
	title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
	journal = {arXiv:2509.15969},
	year = {2025}
	}
	```