Instructions to use fluxions/vui with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Vui
How to use fluxions/vui with Vui:
# !pip install git+https://github.com/fluxions-ai/vui import torchaudio from vui.inference import render from vui.model import Vui, model = Vui.from_pretrained().cuda() waveform = render( model, "Hey, here is some random stuff, usually something quite long as the shorter the text the less likely the model can cope!", ) print(waveform.shape) torchaudio.save("out.opus", waveform[0], 22050) - Notebooks
- Google Colab
- Kaggle
Update README for Vui Nano v1.0.0
Browse files
README.md
CHANGED
|
@@ -1,42 +1,174 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
pipeline_tag: text-to-speech
|
| 6 |
library_name: vui
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
-
# vui
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
https://
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
```sh
|
| 20 |
-
|
| 21 |
```
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
```sh
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
Vui
|
| 32 |
-
Vui.ABRAHAM is a single speaker model that can reply with context awareness.
|
| 33 |
-
Vui.COHOST is checkpoint with two speakers that can talk to each other.
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
#
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
pipeline_tag: text-to-speech
|
| 6 |
library_name: vui
|
| 7 |
+
tags:
|
| 8 |
+
- text-to-speech
|
| 9 |
+
- tts
|
| 10 |
+
- speech-synthesis
|
| 11 |
+
- voice-cloning
|
| 12 |
+
- streaming
|
| 13 |
+
- conversational
|
| 14 |
+
- qwen3-tts
|
| 15 |
---
|
|
|
|
| 16 |
|
| 17 |
+
<p align="center">
|
| 18 |
+
<a href="https://fluxions.ai"><img src="https://raw.githubusercontent.com/fluxions-ai/vui/main/docs/fxlogo.png" alt="fluxions.ai" height="64"></a>
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
<h1 align="center">Vui β Streaming Conversational Voice Assistant</h1>
|
| 22 |
+
|
| 23 |
+
<p align="center"><em>Pronounced "vooey"</em> (rhymes with <em>Louie</em>) Β· by <a href="https://fluxions.ai">fluxions.ai</a></p>
|
| 24 |
+
|
| 25 |
+
<p align="center">
|
| 26 |
+
<a href="https://github.com/fluxions-ai/vui"><img src="https://img.shields.io/badge/GitHub-fluxions--ai%2Fvui-181717?logo=github&style=for-the-badge" alt="GitHub"></a>
|
| 27 |
+
<a href="https://discord.fluxions.ai"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white&style=for-the-badge" alt="Discord"></a>
|
| 28 |
+
</p>
|
| 29 |
+
|
| 30 |
+
π **Full code, install, docs, and the streaming voice assistant: [github.com/fluxions-ai/vui](https://github.com/fluxions-ai/vui)**
|
| 31 |
|
| 32 |
+
π **[Launch blog post](https://fluxions.ai/blog/vui-launch)** β design notes, demos, and what's next.
|
| 33 |
|
| 34 |
+
Vui is a real-time voice assistant: speak into your mic, the model transcribes, runs a local LLM, and streams a TTS reply back β all from a single Python server. Built around **Vui Nano**, a 300M speech transformer based on the Qwen3 TTS. Trained on conversational speech with breaths, laughter, hesitations, and multi-speaker dialogue.
|
| 35 |
|
| 36 |
+
## Features
|
| 37 |
+
|
| 38 |
+
- **Vui Nano (300M)** β Llama-style decoder + RQ-Transformer head over the Qwen3-TTS-12Hz codec
|
| 39 |
+
- **Real-time voice loop** β WebRTC + WebSocket pipeline (ASR β LLM β TTS) with a browser UI, VAD-driven turn taking, speculative LLM prefill while you're still speaking, sentence-level TTS chunking with backpressure
|
| 40 |
+
- **Barge-in** β start talking mid-reply, the model cancels and listens
|
| 41 |
+
- **Streaming TTS** β ~9Γ realtime on a 4090, bf16 inference, CUDA graphs
|
| 42 |
+
- **OpenAI Realtime API compatible** β drop-in `ws://β¦/v1/realtime` for clients written against OpenAI's spec ([`docs/realtime-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/realtime-api.md))
|
| 43 |
+
- **One-shot voice-note REST endpoint** β `POST /v1/voice-note` runs the whole ASR β LLM β TTS pipeline in a single HTTP call (audio in, JSON out)
|
| 44 |
+
- **Standalone TTS demo** β `demo.py` Gradio playground for the model on its own
|
| 45 |
+
- **Voice cloning** β upload an audio sample to clone any speaker; 4 fine-tuned presets shipped (`maeve`, `abraham`, `rhian`, `harry`)
|
| 46 |
+
- **SQ / WPS conditioning** β bias generation on six speech-quality channels and words-per-second
|
| 47 |
+
- **Hot-swap models** β pick Ollama LLM and ASR backend live from the UI
|
| 48 |
+
- **Pluggable ASR** β faster-whisper (GPU) or Moonshine (CPU streaming, ONNX)
|
| 49 |
+
- **Pluggable LLM backends** β Ollama, vLLM, any OpenAI-compatible endpoint
|
| 50 |
+
- **Memories** β assistant remembers facts about you across sessions
|
| 51 |
+
- **Thoughts stream** β parallel LLM routes voice intent to ~10 tools (memory ops, task control, delegation) without a wake-word grammar; pluggable for your own local tools
|
| 52 |
+
- **Optional Claude task server** β sidecar agent that handles slow/agentic work (Gmail, Calendar, Drive, Slack, web search) via your existing Claude Code MCPs
|
| 53 |
+
- **Apple Silicon support** β MLX backend (WIP)
|
| 54 |
+
- **Mobile-ready** β documented cloudflared and Tailscale paths for phone access with mic over HTTPS
|
| 55 |
+
- **Docker compose** β one file brings up the full stack
|
| 56 |
+
- **OpenClaw integration** β point OpenClaw's `openai` realtime provider at Vui for a fully-local voice front-end
|
| 57 |
+
|
| 58 |
+
## Install (one-liner)
|
| 59 |
|
| 60 |
```sh
|
| 61 |
+
curl -fsSL https://install.fluxions.ai | bash
|
| 62 |
```
|
| 63 |
|
| 64 |
+
Clones into `~/vui`, auto-detects Docker vs. native, installs deps (uv, Ollama, ffmpeg, Claude Code CLI), pulls the model from this repo, and launches the stack on <http://localhost:8080>.
|
| 65 |
+
|
| 66 |
+
Full Docker compose / native install, mobile setup, configuration, ASR options, and the Claude task server β all in the **[GitHub README](https://github.com/fluxions-ai/vui)**.
|
| 67 |
+
|
| 68 |
+
## TTS demo on its own
|
| 69 |
|
| 70 |
```sh
|
| 71 |
+
git clone https://github.com/fluxions-ai/vui
|
| 72 |
+
cd vui
|
| 73 |
+
uv sync
|
| 74 |
+
python demo.py # Gradio UI β upload your own voice prompt
|
| 75 |
+
python demo.py --render --prompt prompts/abraham.wav # CLI render with a preset voice
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
The Vui checkpoint and Qwen codec download automatically from this repo on first run.
|
| 79 |
+
|
| 80 |
+
### Preset voices
|
| 81 |
+
|
| 82 |
+
| Voice | Description |
|
| 83 |
+
|---|---|
|
| 84 |
+
| `maeve` | Recommended Default β Female Irish accent, beautiful but may be hard for non-UK listeners |
|
| 85 |
+
| `abraham` | British, well-spoken, exciting energy and personality β conscientious, good at emotionally difficult subjects |
|
| 86 |
+
| `rhian` | More traditional British accent, slightly hesitant speaking style |
|
| 87 |
+
| `harry` | British male accent, mumbly |
|
| 88 |
+
|
| 89 |
+
More personalities coming soon! Got a voice or character you'd like to hear? Open an issue or let us know on [Discord](https://discord.fluxions.ai).
|
| 90 |
+
|
| 91 |
+
### Python API
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
from vui.engine import Engine, GenConfig
|
| 95 |
+
|
| 96 |
+
engine = Engine.from_checkpoint("vui-nano.safetensors")
|
| 97 |
+
with engine.new_row() as row:
|
| 98 |
+
audio = row.render(
|
| 99 |
+
"So [breath] the thing about this is, it's not what you'd expect, right?",
|
| 100 |
+
GenConfig(temperature=0.7),
|
| 101 |
+
)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
**Tip: try turning repetition penalty off.** `GenConfig` defaults `rep_penalty=1.1` to break long silence/filler loops, but it can flatten prosody and distort natural repetition. Setting it to `0` (anything `<= 1.0` disables the penalty path) often gives more natural-sounding output β worth trying if generations sound stilted or over-corrected.
|
| 105 |
|
| 106 |
+
For long voice prompts (>15s) you need proper multi-segment chunking β `vui.prompt_utils.build_prompt_segments` does ASR + forced alignment + sentence-boundary splits at ~10s targets so the model keeps its speaker conditioning across the full reference. Full Python guide covering chunked prompts, streaming, continuous batching, codes-only decode, and the MLX path: [`docs/python-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/python-api.md).
|
| 107 |
|
| 108 |
+
## Vui Nano
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
A 300M autoregressive LM over the Qwen3-TTS speech codec β the first in the Vui model family. The codec and speaker encoder are reused from Alibaba's [`Qwen3-TTS-12Hz-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base);
|
| 111 |
|
| 112 |
+
- **300M parameters**, Llama-style decoder + RQ-Transformer head β 768 dim, 22 layers, 8 heads
|
| 113 |
+
- **Codec**: [Qwen3-TTS-Tokenizer-12Hz](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) β 16 codebooks of 2048 entries at 12.5 Hz, 24 kHz audio (decoded), pure-PyTorch reimplementation in `src/vui/qwen_codec.py`
|
| 114 |
+
- **Speaker encoder**: ECAPA-TDNN from `Qwen3-TTS-12Hz-0.6B-Base` (8.9M params, 1024-dim) β used at training time to embed reference speakers
|
| 115 |
+
- **Output**: 24 kHz audio, bf16 inference, ~9Γ realtime streaming on a 4090
|
| 116 |
|
| 117 |
+
### Voices & voice cloning
|
| 118 |
|
| 119 |
+
**The model can clone arbitrary voices** β upload a sample in the demo UI (or drop a `.wav` into `prompts/`) and it will follow that speaker. **Cloned voices won't sound as good as the four fine-tuned voices** (`maeve`, `abraham`, `rhian`, `harry`) shipped in `prompts/` β the released checkpoint has been fine-tuned on those four, so they're the highest-quality output the model can produce. Arbitrary clones work but expect lower naturalness, more drift, and some bias toward the fine-tuned speakers' prosody.
|
| 120 |
+
|
| 121 |
+
For best results: voice-prompt transcript must match the audio word-for-word, aim for **30 seconds or more** of clean source audio (6-minute context window), and remember garbage in = garbage out. Full guide on voice prompts, supported tags ([breath], [laugh], [sigh] β¦), punctuation rules, and phonetic spelling for numbers/dates/units: [`docs/prompting.md`](https://github.com/fluxions-ai/vui/blob/main/docs/prompting.md).
|
| 122 |
+
|
| 123 |
+
If you need a checkpoint tuned to a specific voice for a legitimate use case (audiobooks, accessibility, game characters, dubbing of consenting performers, internal tooling), **get in touch** via [fluxions.ai](https://fluxions.ai) β we can train, license, or host one for you.
|
| 124 |
+
|
| 125 |
+
## Hardware
|
| 126 |
+
|
| 127 |
+
Streaming server and `demo.py` both run on either:
|
| 128 |
+
- **NVIDIA GPU + Linux** β ~**12 GB VRAM** for the full stack (TTS + ASR + Ollama LLM, 4090 / H100 tested), drops to **~8 GB** if you switch to a `moonshine.*` (CPU) ASR backend. CUDA 12.x, flash-attn installed.
|
| 129 |
+
- **Apple Silicon Mac** β M1/M2/M3/M4, MLX backend (auto-detected, no flash-attn required).
|
| 130 |
+
|
| 131 |
+
Full breakdown β measured per-component VRAM, ASR latency/VRAM per backend, KV-cache math, and tuning levers β is in [`docs/memory-budget.md`](https://github.com/fluxions-ai/vui/blob/main/docs/memory-budget.md).
|
| 132 |
+
|
| 133 |
+
**Tip: drop `n_codebooks` for faster TTS on smaller GPUs.** The RQ-Transformer head decodes 16 RVQ codebook levels per audio frame by default. Dropping the **Codebooks** slider in the UI (or `n_codebooks` in `DEFAULT_SETTINGS`) to **~10** gives noticeably faster decode and lower VRAM at the cost of some stability β occasional artefacts, more sensitivity to hard prompts. Below 8 quality drops sharply. `0` means "use all 16".
|
| 134 |
+
|
| 135 |
+
## Responsible use
|
| 136 |
+
|
| 137 |
+
Vui generates speech that can sound convincingly human. By using this model β directly, through the streaming server, or through the realtime API β you agree to the following:
|
| 138 |
+
|
| 139 |
+
We **explicitly prohibit**:
|
| 140 |
+
|
| 141 |
+
- **Fraud** β generating speech to deceive others for financial gain or to obtain something you would not otherwise be entitled to (scam calls, voice-auth bypass, etc.).
|
| 142 |
+
- **Misinformation or deception** β fake news, fraudulent calls, deepfakes intended to mislead, synthetic media presented as authentic recordings of real people.
|
| 143 |
+
- **Harassment, defamation, or abuse** β generating speech that targets, threatens, or harms others, including non-consensual sexual content.
|
| 144 |
+
- **Illegal activity** β anything unlawful in the jurisdiction where the model is run or its output is distributed.
|
| 145 |
+
|
| 146 |
+
You are responsible for what you generate. The released checkpoint is fine-tuned to a curated voice set in part to make these misuses harder, but it is not a substitute for your own judgment. If you build a product on top of Vui, build in consent flows, content provenance (e.g. [C2PA](https://c2pa.org/)), and abuse reporting.
|
| 147 |
+
|
| 148 |
+
We are **not responsible** for misuse, and we strongly condemn unethical applications of this technology.
|
| 149 |
+
|
| 150 |
+
## Attributions
|
| 151 |
+
|
| 152 |
+
- [Qwen3-TTS-Tokenizer](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) β Alibaba
|
| 153 |
+
- [Whisper](https://github.com/openai/whisper) β OpenAI
|
| 154 |
+
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
|
| 155 |
+
- [Moonshine](https://github.com/usefulsensors/moonshine) β Useful Sensors (CPU-streaming ASR option)
|
| 156 |
+
- [Silero VAD](https://github.com/snakers4/silero-vad)
|
| 157 |
+
- [aiortc](https://github.com/aiortc/aiortc)
|
| 158 |
+
- [Ollama](https://ollama.com)
|
| 159 |
+
|
| 160 |
+
## License
|
| 161 |
+
|
| 162 |
+
Apache 2.0 β applies to the code in the [GitHub repo](https://github.com/fluxions-ai/vui) and the released model weights. The Qwen3-TTS-Tokenizer-12Hz codec and `Qwen3-TTS-12Hz-0.6B-Base` speaker encoder are Β© Alibaba and licensed under the terms in their respective Hugging Face repos.
|
| 163 |
+
|
| 164 |
+
## Citation
|
| 165 |
+
|
| 166 |
+
```bibtex
|
| 167 |
+
@software{vui_2026,
|
| 168 |
+
author = {Coultas Blum, Harry},
|
| 169 |
+
title = {Vui: Streaming Conversational Text-to-Speech},
|
| 170 |
+
url = {https://github.com/fluxions-ai/vui},
|
| 171 |
+
version = {1.0.0},
|
| 172 |
+
year = {2026}
|
| 173 |
+
}
|
| 174 |
+
```
|