fluxions
/

vui

@@ -1,42 +1,174 @@
 ---
-license: mit
 language:
 - en
 pipeline_tag: text-to-speech
 library_name: vui
 ---
-# vui
-[DEMO](https://fluxions.ai)
-https://github.com/fluxions-ai/vui
-Small Conversational speech models that can run on device
-# Installation
 ```sh
-uv pip install -e .
 ```
-# Demo
 ```sh
-python demo.py
-````
-# Models
-Vui.BASE is base checkpoint trained on 40k hours of audio conversations
-Vui.ABRAHAM is a single speaker model that can reply with context awareness.
-Vui.COHOST is checkpoint with two speakers that can talk to each other.
-# Voice Cloning
-You can clone with the base model quite well but it's not perfect as hasn't seen that much audio / wasn't trained for long
-# FAQ
-1) Was developed with on two 4090's https://x.com/harrycblum/status/1752698806184063153
-2) Hallucinations: yes the model does hallucinate, but this is the best I could do with limited resources! :(

 ---
+license: apache-2.0
 language:
 - en
 pipeline_tag: text-to-speech
 library_name: vui
+tags:
+- text-to-speech
+- tts
+- speech-synthesis
+- voice-cloning
+- streaming
+- conversational
+- qwen3-tts
 ---
+<p align="center">
+  <a href="https://fluxions.ai"><img src="https://raw.githubusercontent.com/fluxions-ai/vui/main/docs/fxlogo.png" alt="fluxions.ai" height="64"></a>
+</p>
+<h1 align="center">Vui — Streaming Conversational Voice Assistant</h1>
+<p align="center"><em>Pronounced "vooey"</em> (rhymes with <em>Louie</em>) · by <a href="https://fluxions.ai">fluxions.ai</a></p>
+<p align="center">
+  <a href="https://github.com/fluxions-ai/vui"><img src="https://img.shields.io/badge/GitHub-fluxions--ai%2Fvui-181717?logo=github&style=for-the-badge" alt="GitHub"></a>
+  <a href="https://discord.fluxions.ai"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white&style=for-the-badge" alt="Discord"></a>
+</p>
+👉 **Full code, install, docs, and the streaming voice assistant: [github.com/fluxions-ai/vui](https://github.com/fluxions-ai/vui)**
+📖 **[Launch blog post](https://fluxions.ai/blog/vui-launch)** — design notes, demos, and what's next.
+Vui is a real-time voice assistant: speak into your mic, the model transcribes, runs a local LLM, and streams a TTS reply back — all from a single Python server. Built around **Vui Nano**, a 300M speech transformer based on the Qwen3 TTS. Trained on conversational speech with breaths, laughter, hesitations, and multi-speaker dialogue.
+## Features
+- **Vui Nano (300M)** — Llama-style decoder + RQ-Transformer head over the Qwen3-TTS-12Hz codec
+- **Real-time voice loop** — WebRTC + WebSocket pipeline (ASR → LLM → TTS) with a browser UI, VAD-driven turn taking, speculative LLM prefill while you're still speaking, sentence-level TTS chunking with backpressure
+- **Barge-in** — start talking mid-reply, the model cancels and listens
+- **Streaming TTS** — ~9× realtime on a 4090, bf16 inference, CUDA graphs
+- **OpenAI Realtime API compatible** — drop-in `ws://…/v1/realtime` for clients written against OpenAI's spec ([`docs/realtime-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/realtime-api.md))
+- **One-shot voice-note REST endpoint** — `POST /v1/voice-note` runs the whole ASR → LLM → TTS pipeline in a single HTTP call (audio in, JSON out)
+- **Standalone TTS demo** — `demo.py` Gradio playground for the model on its own
+- **Voice cloning** — upload an audio sample to clone any speaker; 4 fine-tuned presets shipped (`maeve`, `abraham`, `rhian`, `harry`)
+- **SQ / WPS conditioning** — bias generation on six speech-quality channels and words-per-second
+- **Hot-swap models** — pick Ollama LLM and ASR backend live from the UI
+- **Pluggable ASR** — faster-whisper (GPU) or Moonshine (CPU streaming, ONNX)
+- **Pluggable LLM backends** — Ollama, vLLM, any OpenAI-compatible endpoint
+- **Memories** — assistant remembers facts about you across sessions
+- **Thoughts stream** — parallel LLM routes voice intent to ~10 tools (memory ops, task control, delegation) without a wake-word grammar; pluggable for your own local tools
+- **Optional Claude task server** — sidecar agent that handles slow/agentic work (Gmail, Calendar, Drive, Slack, web search) via your existing Claude Code MCPs
+- **Apple Silicon support** — MLX backend (WIP)
+- **Mobile-ready** — documented cloudflared and Tailscale paths for phone access with mic over HTTPS
+- **Docker compose** — one file brings up the full stack
+- **OpenClaw integration** — point OpenClaw's `openai` realtime provider at Vui for a fully-local voice front-end
+## Install (one-liner)
 ```sh
+curl -fsSL https://install.fluxions.ai | bash
 ```
+Clones into `~/vui`, auto-detects Docker vs. native, installs deps (uv, Ollama, ffmpeg, Claude Code CLI), pulls the model from this repo, and launches the stack on <http://localhost:8080>.
+Full Docker compose / native install, mobile setup, configuration, ASR options, and the Claude task server — all in the **[GitHub README](https://github.com/fluxions-ai/vui)**.
+## TTS demo on its own
 ```sh
+git clone https://github.com/fluxions-ai/vui
+cd vui
+uv sync
+python demo.py                                          # Gradio UI — upload your own voice prompt
+python demo.py --render --prompt prompts/abraham.wav    # CLI render with a preset voice
+```
+The Vui checkpoint and Qwen codec download automatically from this repo on first run.
+### Preset voices
+| Voice | Description |
+|---|---|
+| `maeve` | Recommended Default — Female Irish accent, beautiful but may be hard for non-UK listeners |
+| `abraham` | British, well-spoken, exciting energy and personality — conscientious, good at emotionally difficult subjects |
+| `rhian` | More traditional British accent, slightly hesitant speaking style |
+| `harry` | British male accent, mumbly |
+More personalities coming soon! Got a voice or character you'd like to hear? Open an issue or let us know on [Discord](https://discord.fluxions.ai).
+### Python API
+```python
+from vui.engine import Engine, GenConfig
+engine = Engine.from_checkpoint("vui-nano.safetensors")
+with engine.new_row() as row:
+    audio = row.render(
+        "So [breath] the thing about this is, it's not what you'd expect, right?",
+        GenConfig(temperature=0.7),
+    )
+```
+**Tip: try turning repetition penalty off.** `GenConfig` defaults `rep_penalty=1.1` to break long silence/filler loops, but it can flatten prosody and distort natural repetition. Setting it to `0` (anything `<= 1.0` disables the penalty path) often gives more natural-sounding output — worth trying if generations sound stilted or over-corrected.
+For long voice prompts (>15s) you need proper multi-segment chunking — `vui.prompt_utils.build_prompt_segments` does ASR + forced alignment + sentence-boundary splits at ~10s targets so the model keeps its speaker conditioning across the full reference. Full Python guide covering chunked prompts, streaming, continuous batching, codes-only decode, and the MLX path: [`docs/python-api.md`](https://github.com/fluxions-ai/vui/blob/main/docs/python-api.md).
+## Vui Nano
+A 300M autoregressive LM over the Qwen3-TTS speech codec — the first in the Vui model family. The codec and speaker encoder are reused from Alibaba's [`Qwen3-TTS-12Hz-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base);
+- **300M parameters**, Llama-style decoder + RQ-Transformer head — 768 dim, 22 layers, 8 heads
+- **Codec**: [Qwen3-TTS-Tokenizer-12Hz](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) — 16 codebooks of 2048 entries at 12.5 Hz, 24 kHz audio (decoded), pure-PyTorch reimplementation in `src/vui/qwen_codec.py`
+- **Speaker encoder**: ECAPA-TDNN from `Qwen3-TTS-12Hz-0.6B-Base` (8.9M params, 1024-dim) — used at training time to embed reference speakers
+- **Output**: 24 kHz audio, bf16 inference, ~9× realtime streaming on a 4090
+### Voices & voice cloning
+**The model can clone arbitrary voices** — upload a sample in the demo UI (or drop a `.wav` into `prompts/`) and it will follow that speaker. **Cloned voices won't sound as good as the four fine-tuned voices** (`maeve`, `abraham`, `rhian`, `harry`) shipped in `prompts/` — the released checkpoint has been fine-tuned on those four, so they're the highest-quality output the model can produce. Arbitrary clones work but expect lower naturalness, more drift, and some bias toward the fine-tuned speakers' prosody.
+For best results: voice-prompt transcript must match the audio word-for-word, aim for **30 seconds or more** of clean source audio (6-minute context window), and remember garbage in = garbage out. Full guide on voice prompts, supported tags ([breath], [laugh], [sigh] …), punctuation rules, and phonetic spelling for numbers/dates/units: [`docs/prompting.md`](https://github.com/fluxions-ai/vui/blob/main/docs/prompting.md).
+If you need a checkpoint tuned to a specific voice for a legitimate use case (audiobooks, accessibility, game characters, dubbing of consenting performers, internal tooling), **get in touch** via [fluxions.ai](https://fluxions.ai) — we can train, license, or host one for you.
+## Hardware
+Streaming server and `demo.py` both run on either:
+- **NVIDIA GPU + Linux** — ~**12 GB VRAM** for the full stack (TTS + ASR + Ollama LLM, 4090 / H100 tested), drops to **~8 GB** if you switch to a `moonshine.*` (CPU) ASR backend. CUDA 12.x, flash-attn installed.
+- **Apple Silicon Mac** — M1/M2/M3/M4, MLX backend (auto-detected, no flash-attn required).
+Full breakdown — measured per-component VRAM, ASR latency/VRAM per backend, KV-cache math, and tuning levers — is in [`docs/memory-budget.md`](https://github.com/fluxions-ai/vui/blob/main/docs/memory-budget.md).
+**Tip: drop `n_codebooks` for faster TTS on smaller GPUs.** The RQ-Transformer head decodes 16 RVQ codebook levels per audio frame by default. Dropping the **Codebooks** slider in the UI (or `n_codebooks` in `DEFAULT_SETTINGS`) to **~10** gives noticeably faster decode and lower VRAM at the cost of some stability — occasional artefacts, more sensitivity to hard prompts. Below 8 quality drops sharply. `0` means "use all 16".
+## Responsible use
+Vui generates speech that can sound convincingly human. By using this model — directly, through the streaming server, or through the realtime API — you agree to the following:
+We **explicitly prohibit**:
+- **Fraud** — generating speech to deceive others for financial gain or to obtain something you would not otherwise be entitled to (scam calls, voice-auth bypass, etc.).
+- **Misinformation or deception** — fake news, fraudulent calls, deepfakes intended to mislead, synthetic media presented as authentic recordings of real people.
+- **Harassment, defamation, or abuse** — generating speech that targets, threatens, or harms others, including non-consensual sexual content.
+- **Illegal activity** — anything unlawful in the jurisdiction where the model is run or its output is distributed.
+You are responsible for what you generate. The released checkpoint is fine-tuned to a curated voice set in part to make these misuses harder, but it is not a substitute for your own judgment. If you build a product on top of Vui, build in consent flows, content provenance (e.g. [C2PA](https://c2pa.org/)), and abuse reporting.
+We are **not responsible** for misuse, and we strongly condemn unethical applications of this technology.
+## Attributions
+- [Qwen3-TTS-Tokenizer](https://huggingface.co/Qwen/Qwen3-TTS-Tokenizer-12Hz) — Alibaba
+- [Whisper](https://github.com/openai/whisper) — OpenAI
+- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
+- [Moonshine](https://github.com/usefulsensors/moonshine) — Useful Sensors (CPU-streaming ASR option)
+- [Silero VAD](https://github.com/snakers4/silero-vad)
+- [aiortc](https://github.com/aiortc/aiortc)
+- [Ollama](https://ollama.com)
+## License
+Apache 2.0 — applies to the code in the [GitHub repo](https://github.com/fluxions-ai/vui) and the released model weights. The Qwen3-TTS-Tokenizer-12Hz codec and `Qwen3-TTS-12Hz-0.6B-Base` speaker encoder are © Alibaba and licensed under the terms in their respective Hugging Face repos.
+## Citation
+```bibtex
+@software{vui_2026,
+  author  = {Coultas Blum, Harry},
+  title   = {Vui: Streaming Conversational Text-to-Speech},
+  url     = {https://github.com/fluxions-ai/vui},
+  version = {1.0.0},
+  year    = {2026}
+}
+```