PrimeTTS — tiny bilingual zh‑TW + English TTS (24 kHz, CPU)
A 4.63M‑parameter Mandarin (Taiwan) + English text‑to‑speech model that runs entirely on CPU and emits 24 kHz audio — sized for on‑device (Jetson‑class) and contact‑centre / GPS / transit use. One model, one young‑female voice: Chinese, English, and code‑mix through a single frontend (no language routing). Built for entity correctness — phone numbers, emails, addresses, prices, dates, temperatures, percentages, serial numbers, and a broad bank of Taiwan/world named entities.
🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 · 🧩 Base:
owensong/Inflect-Nano-v1(warm‑started fine‑tune, same frozen architecture)
| Parameters | 4.63M (3.47M acoustic + 1.17M vocoder) |
| Sample rate | 24 kHz |
| Runtime | onnxruntime, CPU‑only, torch‑free at inference |
| Languages | zh‑TW (Traditional) + English + code‑mix, single voice |
| Voice | young female, Taiwan‑Mandarin accent |
| Architecture | FastSpeech‑style (no attention) + Snake‑HiFiGAN — frozen, no NAS |
| License | Apache‑2.0 |
Held‑out quality (eval_big, 36 unseen phone‑attendant sentences)
| metric | this model (24 kHz, CC0) | prior 8 kHz release |
|---|---|---|
| zh‑CER (Breeze‑ASR‑25) | 0.106 | 0.090 |
| code‑mix CER | 0.096 | 0.178 |
| en‑WER (Whisper) | 0.083 | 0.083 |
| Taiwan‑accent gap¹ | +0.044 | +0.088 |
| SQUIM PESQ | 3.22 | 3.31 |
| SQUIM MOS | 4.40 | 4.24 |
This 24 kHz release delivers a large code‑mix gain (0.178 → 0.096), higher MOS (4.24 → 4.40), 24 kHz clarity, a CC0 / commercially‑clear reference voice, and a much larger entity‑coverage corpus (≈30k clips vs 6.6k). The shipped checkpoint is the 60k step — the 2D mel‑GAN both sharpens the acoustic mel (clarity) and converges to the best held‑out intelligibility by 60k (a mid‑training 40k candidate dipped to zh‑CER 0.134 before the GAN settled).
¹ CER(generic ASR) − CER(Taiwan‑tuned Breeze‑ASR‑25) per zh clip; >0 ⇒ a Taiwan‑tuned recognizer
understands it better ⇒ genuine Taiwan accent present.
Quickstart (inference, CPU)
pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
# from inside the PrimeTTS dir (uses the bundled frontend + scripts)
import sys; sys.path.insert(0, "scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate # numpy length‑regulator
meta = json.load(open("meta.json"))
enc = ort.InferenceSession("acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession("vocoder.onnx", providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.") # text -> phone/tone/lang ids
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn,
"lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])
The whole pipeline — encoder.onnx → numpy length‑regulator → decoder.onnx → vocoder.onnx — is
torch‑free and runs as‑is on a Jetson Nano CPU. See scripts/synth_from_text.py for the full runtime.
Training data
Everything is distilled from a single teacher voice so zh / en / code‑mix share one timbre and accent.
- Reference voice — a young Taiwan‑female speaker from Mozilla Common Voice zh‑TW, released CC0 / public domain (commercial‑use and voice‑cloning clear). ~13 s assembled from that one speaker's cleanest validated clips. This fixes the accent (Taiwan Mandarin comes from the reference, not from prompting) and keeps the model commercially shippable — no proprietary/voice‑likeness encumbrance.
- Teacher — VoxCPM2 (
openbmb/VoxCPM2) voice‑clones that one reference for every line, giving a consistent young‑female voice across all three languages (48 kHz, resampled to 24 kHz for training). - Text — Taiwan office / phone‑attendant / GPS / transit register: diverse Mandarin, general + domain English, and frame‑bank code‑mix with English in varied positions, plus a large named‑entity bank: Taiwan place & road names, transit stations, top Taiwan + world companies, famous people (TW + world), movies, electronics products, and time/date/metric expressions.
- Entity normalization (
text_norm.py, applied identically to teacher text and at inference) gives consistent readings for phone numbers, extensions, email addresses, street addresses, prices, dates (zh + en), times, temperatures (°C), percentages, decimals, counts, and serial numbers — digit‑by‑digit vs cardinal vs ordinal chosen by entity + language context. - ASR quality gate — generic clips are transcribed and kept only if they match their text, using a
Taiwan‑tuned recognizer so the gate never penalizes the accent we want (proper‑noun‑heavy coverage
clips are trusted unfiltered, since ASR mangles proper nouns):
- zh & code‑mix → Breeze‑ASR‑25 Han‑level CER
- English → Whisper‑medium WER
| split | clips (post‑gate) |
|---|---|
| pure Chinese | 11,842 |
| code‑mix (zh+en) | 13,422 |
| pure English | 4,283 |
| total | 29,547 |
The corpus is assembled from 32,500 teacher clips; the generic subset passes the ASR gate (≈21% dropped), the named‑entity coverage subset is trusted unfiltered, and English rows are upsampled ×2 (~27% exposure) to protect English quality. English phones additionally carry v1's native pronunciation via the warm‑start.
How it was trained — the levers
Inflect‑Nano‑v1's 4.63M architecture is not capacity‑limited for this task. Quality came from four fixable things, all keeping the architecture frozen (no NAS, no param changes):
- Phone‑level alignment (
align_durations_v4.py) — true per‑phone durations (espeak phoneme‑CTC +torchaudio.forced_align) instead of crude char/letter CTC. Skipping this is what makes tiny TTS garble. - Vocabulary coverage + diverse code‑mix — broad character coverage and a code‑mix frame bank (varied syntax, English in varied positions) so the model isn't overfit to a few templates.
- Teacher choice — the English a tiny model learns is only as native as the teacher's. A Taiwan‑ biased teacher gave flat, accented English; VoxCPM2 gives clean, natural zh and en in one voice.
- Warm‑start from Inflect‑Nano‑v1 — the acoustic model is initialized from the English‑native v1 checkpoint (199/199 tensors copied, 0 skipped — the bilingual symbol table already matches), so v1's English transfers directly; the corpus then teaches Taiwan Mandarin on top.
A 2D mel‑GAN discriminator (training‑only; ONNX is unchanged) sharpens the mel after a 25k pure‑ reconstruction warmup, lifting PESQ/MOS. The shipped checkpoint is the 60k step — by then the GAN has both sharpened the mel (clarity) and converged to the best held‑out intelligibility (sweep the held‑out set: a mid‑training 40k point dipped before the GAN settled, then 60k recovered to the best on all axes).
Architecture
- Acoustic —
MicroFastSpeech(~3.47M): depthwise Conv‑FFN, no attention, external durations + length regulator, frame‑pitch, BiGRU, postnet. - Vocoder — Snake‑HiFiGAN (~1.17M), 24 kHz variant
snake_v2mid(sr 24000, n_fft 1024, hop 256, 80 mels, fmax 12000), retrained on the teacher corpus. - Frontend —
g2pw(Taiwan bopomofo + polyphone disambiguation) +g2p_en(arpabet), merged into one phone sequence with per‑phone language ids → handles zh, en, and code‑mix in a single pass.
Reproduce / fine‑tune your own
Pipeline: teacher corpus → ASR gate → align → train vocoder → warm‑start + train acoustic → export. Repo layout:
acoustic_encoder.onnx acoustic_decoder.onnx vocoder.onnx meta.json symbol_table.json ← deployable weights (24 kHz)
checkpoints/inflect-micro-fastspeech-60000.pt checkpoints/hifigan-snake_v2mid-final.pt ← shipped checkpoints
scripts/ frontend, aligner, corpus‑gen, diverse‑text, train, export, eval
inflect_nano/ the trainer (acoustic.py + vocoder.py), forked from Inflect‑Nano‑v1 (LICENSE included)
Prerequisites: Python 3.12, a GPU for training; pip install torch torchaudio transformers onnxruntime soundfile librosa g2pw g2p_en cn2an opencc faster-whisper edge-tts.
1 · Teacher corpus (one cloned voice)
# make a Taiwan‑female reference, then VoxCPM2‑clone every line in that voice
edge-tts --voice zh-TW-HsiaoChenNeural --text "<ref sentence>" --write-media ref.mp3
python gen_voxcpm_corpus.py --texts texts.jsonl --ref ref.wav --ref-text ref.txt \
--out-dir corpus --manifest manifest.jsonl
2 · ASR quality gate (Taiwan‑tuned)
python asr_filter.py --manifest manifest.jsonl --out manifest \
--device cuda # Breeze‑ASR‑25 (zh/mix) + Whisper‑medium (en) → manifest.clean.jsonl
3 · Phone‑level alignment ⭐ the key step
python scripts/align_durations_v4.py --manifest manifest.clean.jsonl --out align.jsonl
4 · Train the 24 kHz vocoder
PYTHONPATH=. python -m inflect_nano.vocoder --train-jsonl voc_rows.jsonl \
--out-dir vocoder_24k --variant snake_v2mid --steps 40000 --segment-size 16384 --stft-weight 2.5
5 · Warm‑start + train the acoustic (GAN recipe)
PYTHONPATH=. python -m inflect_nano.acoustic --durations-jsonl align.jsonl \
--out-dir acoustic_24k --vocoder-variant snake_v2mid --sample-rate 24000 \
--vocoder-checkpoint vocoder_24k/hifigan-snake_v2mid-final.pt --vocoder-mel-weight 1.0 \
--init-checkpoint inflect_nano_v1_acoustic.pt \
--mel-gan-weight 0.1 --gan-2d --gan-fm-auto --gan-r1-gamma 1.0 --gan-crop 128 --gan-warmup-steps 25000 \
--steps 60000 --batch-size 8 --max-frames 1400 --en-upsample 2
6 · Export to ONNX + evaluate
python scripts/export_8k.py --acoustic-ckpt acoustic_8k/…pt --vocoder-ckpt vocoder_8k/…pt --out-dir onnx/
python scripts/synth_from_text.py --onnx-dir onnx --out-dir syn --texts eval.jsonl
python scripts/assess_big.py --synth-dir syn # offline CER/WER
Evaluate on ≥30 held‑out sentences — small eval sets are too noisy to trust. Sweep checkpoints and pick the held‑out sweet spot (the GAN keeps improving train‑set sharpness past the held‑out optimum).
Train on your OWN voice — one command
Swap the reference voice; everything else (text pools, ASR gate, alignment, recipe) is fixed. Both
vocoder and acoustic are retrained (both are voice-specific). Text pools + eval sets are bundled in
data/ and at the repo root, so it reproduces exactly.
# 0. one venv with the deps (see prereqs in scripts/rebuild_voice.sh), PYTHONPATH=repo root,
# and inflect_nano_v1_acoustic.pt from owensong/Inflect-Nano-v1 for the warm-start.
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS && cd PrimeTTS
cp data/*.jsonl data/*.txt . # text pools at root
# 1. a ~10 s clip of your voice. For a commercial-clear reference, use a CC0 source such as
# Mozilla Common Voice zh-TW (the shipped model uses a young-female Common Voice speaker). Or synth one:
edge-tts --voice zh-TW-HsiaoYuNeural --text "您好,歡迎來電。Thank you for calling." --write-media ref.mp3
ffmpeg -y -i ref.mp3 -ar 24000 -ac 1 ref.wav ; printf '%s' "您好,歡迎來電。Thank you for calling." > ref.txt
# 2. ONE command -> corpus -> gate -> align -> vocoder -> acoustic -> export
PY=/path/to/venv/bin/python ./scripts/rebuild_voice.sh ref.wav ref.txt myvoice
# -> pick best corpus_myvoice/onnx_<K>/ (~35k is the usual held-out sweet spot)
Time on dual RTX 5090: ≈ 9 h end-to-end (6.5 h to a shippable 35k checkpoint) — synth ~2 h,
gate+align ~25 min, then vocoder (3 h) ∥ acoustic (~4–7 h) in parallel, export ~15 min.
Credits & licenses
- Base model / trainer:
owensong/Inflect-Nano-v1(Apache‑2.0; seeinflect_nano/LICENSE.inflect-nano) - Teacher TTS:
openbmb/VoxCPM2· Reference voice: Mozilla Common Voice zh‑TW (CC0 / public domain) - Gate ASR:
Breeze-ASR-25(MediaTek Research, Taiwan Mandarin + code‑switch) · OpenAI Whisper‑medium - Aligner:
facebook/wav2vec2-lv-60-espeak-cv-ft+torchaudio.forced_align - Frontend:
g2pw(Taiwan readings) +g2p_en· Eval ASR: sherpa‑onnx X‑ASR (zh‑en Zipformer)
This repository: Apache‑2.0.
Model tree for Luigi/PrimeTTS
Base model
owensong/Inflect-Nano-v1