harrycb commited on
Commit
61781fa
Β·
verified Β·
1 Parent(s): 01a1f42

Update README for Vui Nano v1.0.0

Browse files
Files changed (1) hide show
  1. README.md +152 -20
README.md CHANGED
@@ -1,42 +1,174 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
  pipeline_tag: text-to-speech
6
  library_name: vui
7
-
 
 
 
 
 
 
 
8
  ---
9
- # vui
10
 
11
- [DEMO](https://fluxions.ai)
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- https://github.com/fluxions-ai/vui
14
 
15
- Small Conversational speech models that can run on device
16
 
17
- # Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```sh
20
- uv pip install -e .
21
  ```
22
 
23
- # Demo
 
 
 
 
24
 
25
  ```sh
26
- python demo.py
27
- ````
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- # Models
30
 
31
- Vui.BASE is base checkpoint trained on 40k hours of audio conversations
32
- Vui.ABRAHAM is a single speaker model that can reply with context awareness.
33
- Vui.COHOST is checkpoint with two speakers that can talk to each other.
34
 
35
- # Voice Cloning
36
 
37
- You can clone with the base model quite well but it's not perfect as hasn't seen that much audio / wasn't trained for long
 
 
 
38
 
39
- # FAQ
40
 
41
- 1) Was developed with on two 4090's https://x.com/harrycblum/status/1752698806184063153
42
- 2) Hallucinations: yes the model does hallucinate, but this is the best I could do with limited resources! :(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
5
  pipeline_tag: text-to-speech
6
  library_name: vui
7
+ tags:
8
+ - text-to-speech
9
+ - tts
10
+ - speech-synthesis
11
+ - voice-cloning
12
+ - streaming
13
+ - conversational
14
+ - qwen3-tts
15
  ---
 
16
 
17
+ <p align="center">
18
+ <a href="https://fluxions.ai"><img src="https://raw.githubusercontent.com/fluxions-ai/vui/main/docs/fxlogo.png" alt="fluxions.ai" height="64"></a>
19
+ </p>
20
+
21
+ <h1 align="center">Vui β€” Streaming Conversational Voice Assistant</h1>
22
+
23
+ <p align="center"><em>Pronounced "vooey"</em> (rhymes with <em>Louie</em>) Β· by <a href="https://fluxions.ai">fluxions.ai</a></p>
24
+
25
+ <p align="center">
26
+ <a href="https://github.com/fluxions-ai/vui"><img src="https://img.shields.io/badge/GitHub-fluxions--ai%2Fvui-181717?logo=github&style=for-the-badge" alt="GitHub"></a>
27
+ <a href="https://discord.fluxions.ai"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white&style=for-the-badge" alt="Discord"></a>
28
+ </p>
29
+
30
+ πŸ‘‰ **Full code, install, docs, and the streaming voice assistant: [github.com/fluxions-ai/vui](https://github.com/fluxions-ai/vui)**
31
 
32
+ πŸ“– **[Launch blog post](https://fluxions.ai/blog/vui-launch)** β€” design notes, demos, and what's next.
33
 
34
+ Vui is a real-time voice assistant: speak into your mic, the model transcribes, runs a local LLM, and streams a TTS reply back β€” all from a single Python server. Built around **Vui Nano**, a 300M speech transformer based on the Qwen3 TTS. Trained on conversational speech with breaths, laughter, hesitations, and multi-speaker dialogue.
35
 
36
+ ## Features
37
+
38
+ - **Vui Nano (300M)** β€” Llama-style decoder + RQ-Transformer head over the Qwen3-TTS-12Hz codec
39
+ - **Real-time voice loop** β€” WebRTC + WebSocket pipeline (ASR β†’ LLM β†’ TTS) with a browser UI, VAD-driven turn taking, speculative LLM prefill while you're still speaking, sentence-level TTS chunking with backpressure
40
+ - **Barge-in** β€” start talking mid-reply, the model cancels and listens
41
+ - **Streaming TTS** β€” ~9Γ— realtime on a 4090, bf16 inference, CUDA graphs
42
+ - **OpenAI Realtime API compatible** β€” drop-in `ws://…/v1/realtime` for clients written against OpenAI's spec ([`docs/realtime-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/realtime-api.md))
43
+ - **One-shot voice-note REST endpoint** β€” `POST /v1/voice-note` runs the whole ASR β†’ LLM β†’ TTS pipeline in a single HTTP call (audio in, JSON out)
44
+ - **Standalone TTS demo** β€” `demo.py` Gradio playground for the model on its own
45
+ - **Voice cloning** β€” upload an audio sample to clone any speaker; 4 fine-tuned presets shipped (`maeve`, `abraham`, `rhian`, `harry`)
46
+ - **SQ / WPS conditioning** β€” bias generation on six speech-quality channels and words-per-second
47
+ - **Hot-swap models** β€” pick Ollama LLM and ASR backend live from the UI
48
+ - **Pluggable ASR** β€” faster-whisper (GPU) or Moonshine (CPU streaming, ONNX)
49
+ - **Pluggable LLM backends** β€” Ollama, vLLM, any OpenAI-compatible endpoint
50
+ - **Memories** β€” assistant remembers facts about you across sessions
51
+ - **Thoughts stream** β€” parallel LLM routes voice intent to ~10 tools (memory ops, task control, delegation) without a wake-word grammar; pluggable for your own local tools
52
+ - **Optional Claude task server** β€” sidecar agent that handles slow/agentic work (Gmail, Calendar, Drive, Slack, web search) via your existing Claude Code MCPs
53
+ - **Apple Silicon support** β€” MLX backend (WIP)
54
+ - **Mobile-ready** β€” documented cloudflared and Tailscale paths for phone access with mic over HTTPS
55
+ - **Docker compose** β€” one file brings up the full stack
56
+ - **OpenClaw integration** β€” point OpenClaw's `openai` realtime provider at Vui for a fully-local voice front-end
57
+
58
+ ## Install (one-liner)
59
 
60
  ```sh
61
+ curl -fsSL https://install.fluxions.ai | bash
62
  ```
63
 
64
+ Clones into `~/vui`, auto-detects Docker vs. native, installs deps (uv, Ollama, ffmpeg, Claude Code CLI), pulls the model from this repo, and launches the stack on <http://localhost:8080>.
65
+
66
+ Full Docker compose / native install, mobile setup, configuration, ASR options, and the Claude task server β€” all in the **[GitHub README](https://github.com/fluxions-ai/vui)**.
67
+
68
+ ## TTS demo on its own
69
 
70
  ```sh
71
+ git clone https://github.com/fluxions-ai/vui
72
+ cd vui
73
+ uv sync
74
+ python demo.py # Gradio UI β€” upload your own voice prompt
75
+ python demo.py --render --prompt prompts/abraham.wav # CLI render with a preset voice
76
+ ```
77
+
78
+ The Vui checkpoint and Qwen codec download automatically from this repo on first run.
79
+
80
+ ### Preset voices
81
+
82
+ | Voice | Description |
83
+ |---|---|
84
+ | `maeve` | Recommended Default β€” Female Irish accent, beautiful but may be hard for non-UK listeners |
85
+ | `abraham` | British, well-spoken, exciting energy and personality β€” conscientious, good at emotionally difficult subjects |
86
+ | `rhian` | More traditional British accent, slightly hesitant speaking style |
87
+ | `harry` | British male accent, mumbly |
88
+
89
+ More personalities coming soon! Got a voice or character you'd like to hear? Open an issue or let us know on [Discord](https://discord.fluxions.ai).
90
+
91
+ ### Python API
92
+
93
+ ```python
94
+ from vui.engine import Engine, GenConfig
95
+
96
+ engine = Engine.from_checkpoint("vui-nano.safetensors")
97
+ with engine.new_row() as row:
98
+ audio = row.render(
99
+ "So [breath] the thing about this is, it's not what you'd expect, right?",
100
+ GenConfig(temperature=0.7),
101
+ )
102
+ ```
103
+
104
+ **Tip: try turning repetition penalty off.** `GenConfig` defaults `rep_penalty=1.1` to break long silence/filler loops, but it can flatten prosody and distort natural repetition. Setting it to `0` (anything `<= 1.0` disables the penalty path) often gives more natural-sounding output β€” worth trying if generations sound stilted or over-corrected.
105
 
106
+ For long voice prompts (>15s) you need proper multi-segment chunking β€” `vui.prompt_utils.build_prompt_segments` does ASR + forced alignment + sentence-boundary splits at ~10s targets so the model keeps its speaker conditioning across the full reference. Full Python guide covering chunked prompts, streaming, continuous batching, codes-only decode, and the MLX path: [`docs/python-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/python-api.md).
107
 
108
+ ## Vui Nano
 
 
109
 
110
+ A 300M autoregressive LM over the Qwen3-TTS speech codec β€” the first in the Vui model family. The codec and speaker encoder are reused from Alibaba's [`Qwen3-TTS-12Hz-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base);
111
 
112
+ - **300M parameters**, Llama-style decoder + RQ-Transformer head β€” 768 dim, 22 layers, 8 heads
113
+ - **Codec**: [Qwen3-TTS-Tokenizer-12Hz](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) β€” 16 codebooks of 2048 entries at 12.5 Hz, 24 kHz audio (decoded), pure-PyTorch reimplementation in `src/vui/qwen_codec.py`
114
+ - **Speaker encoder**: ECAPA-TDNN from `Qwen3-TTS-12Hz-0.6B-Base` (8.9M params, 1024-dim) β€” used at training time to embed reference speakers
115
+ - **Output**: 24 kHz audio, bf16 inference, ~9Γ— realtime streaming on a 4090
116
 
117
+ ### Voices & voice cloning
118
 
119
+ **The model can clone arbitrary voices** β€” upload a sample in the demo UI (or drop a `.wav` into `prompts/`) and it will follow that speaker. **Cloned voices won't sound as good as the four fine-tuned voices** (`maeve`, `abraham`, `rhian`, `harry`) shipped in `prompts/` β€” the released checkpoint has been fine-tuned on those four, so they're the highest-quality output the model can produce. Arbitrary clones work but expect lower naturalness, more drift, and some bias toward the fine-tuned speakers' prosody.
120
+
121
+ For best results: voice-prompt transcript must match the audio word-for-word, aim for **30 seconds or more** of clean source audio (6-minute context window), and remember garbage in = garbage out. Full guide on voice prompts, supported tags ([breath], [laugh], [sigh] …), punctuation rules, and phonetic spelling for numbers/dates/units: [`docs/prompting.md`](https://github.com/fluxions-ai/vui/blob/main/docs/prompting.md).
122
+
123
+ If you need a checkpoint tuned to a specific voice for a legitimate use case (audiobooks, accessibility, game characters, dubbing of consenting performers, internal tooling), **get in touch** via [fluxions.ai](https://fluxions.ai) β€” we can train, license, or host one for you.
124
+
125
+ ## Hardware
126
+
127
+ Streaming server and `demo.py` both run on either:
128
+ - **NVIDIA GPU + Linux** β€” ~**12 GB VRAM** for the full stack (TTS + ASR + Ollama LLM, 4090 / H100 tested), drops to **~8 GB** if you switch to a `moonshine.*` (CPU) ASR backend. CUDA 12.x, flash-attn installed.
129
+ - **Apple Silicon Mac** β€” M1/M2/M3/M4, MLX backend (auto-detected, no flash-attn required).
130
+
131
+ Full breakdown β€” measured per-component VRAM, ASR latency/VRAM per backend, KV-cache math, and tuning levers β€” is in [`docs/memory-budget.md`](https://github.com/fluxions-ai/vui/blob/main/docs/memory-budget.md).
132
+
133
+ **Tip: drop `n_codebooks` for faster TTS on smaller GPUs.** The RQ-Transformer head decodes 16 RVQ codebook levels per audio frame by default. Dropping the **Codebooks** slider in the UI (or `n_codebooks` in `DEFAULT_SETTINGS`) to **~10** gives noticeably faster decode and lower VRAM at the cost of some stability β€” occasional artefacts, more sensitivity to hard prompts. Below 8 quality drops sharply. `0` means "use all 16".
134
+
135
+ ## Responsible use
136
+
137
+ Vui generates speech that can sound convincingly human. By using this model β€” directly, through the streaming server, or through the realtime API β€” you agree to the following:
138
+
139
+ We **explicitly prohibit**:
140
+
141
+ - **Fraud** β€” generating speech to deceive others for financial gain or to obtain something you would not otherwise be entitled to (scam calls, voice-auth bypass, etc.).
142
+ - **Misinformation or deception** β€” fake news, fraudulent calls, deepfakes intended to mislead, synthetic media presented as authentic recordings of real people.
143
+ - **Harassment, defamation, or abuse** β€” generating speech that targets, threatens, or harms others, including non-consensual sexual content.
144
+ - **Illegal activity** β€” anything unlawful in the jurisdiction where the model is run or its output is distributed.
145
+
146
+ You are responsible for what you generate. The released checkpoint is fine-tuned to a curated voice set in part to make these misuses harder, but it is not a substitute for your own judgment. If you build a product on top of Vui, build in consent flows, content provenance (e.g. [C2PA](https://c2pa.org/)), and abuse reporting.
147
+
148
+ We are **not responsible** for misuse, and we strongly condemn unethical applications of this technology.
149
+
150
+ ## Attributions
151
+
152
+ - [Qwen3-TTS-Tokenizer](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) β€” Alibaba
153
+ - [Whisper](https://github.com/openai/whisper) β€” OpenAI
154
+ - [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
155
+ - [Moonshine](https://github.com/usefulsensors/moonshine) β€” Useful Sensors (CPU-streaming ASR option)
156
+ - [Silero VAD](https://github.com/snakers4/silero-vad)
157
+ - [aiortc](https://github.com/aiortc/aiortc)
158
+ - [Ollama](https://ollama.com)
159
+
160
+ ## License
161
+
162
+ Apache 2.0 β€” applies to the code in the [GitHub repo](https://github.com/fluxions-ai/vui) and the released model weights. The Qwen3-TTS-Tokenizer-12Hz codec and `Qwen3-TTS-12Hz-0.6B-Base` speaker encoder are Β© Alibaba and licensed under the terms in their respective Hugging Face repos.
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @software{vui_2026,
168
+ author = {Coultas Blum, Harry},
169
+ title = {Vui: Streaming Conversational Text-to-Speech},
170
+ url = {https://github.com/fluxions-ai/vui},
171
+ version = {1.0.0},
172
+ year = {2026}
173
+ }
174
+ ```