Supertonic-2 (MLX)
Supertonic-2-MLX is a pure-MLX port of
Supertone/supertonic-2,
a lightning-fast on-device TTS system. It runs natively on Apple Silicon
through mlx-audio โ no ONNX
Runtime, no Python inference server, just mx.load + Metal.
- 66M params, 4 sub-models (duration predictor, text encoder, flow-matching vector estimator, Vocos-style vocoder).
- 5 languages: English, Korean, Spanish, Portuguese, French.
- 10 preset voices:
M1โM5(male),F1โF5(female). - 44.1 kHz output, ~0.03 RTF on M4 Pro with 5 Euler steps.
- float32 parity with the upstream ONNX Runtime pipeline.
Install
Supertonic support hasn't been upstreamed to mlx-audio yet โ install the
fork typomonster/mlx-audio:
pip install git+https://github.com/typomonster/mlx-audio.git
Quick start
from mlx_audio.tts import load
# Downloads this repo on first run and caches under ~/.cache/huggingface/.
model = load("typomonster/supertonic-2-mlx")
for r in model.generate("Hello world.", voice="M1", lang="en"):
# r.audio is an mx.array at model.sample_rate (44100 Hz)
print(r.samples, r.real_time_factor)
Save to WAV
import numpy as np, soundfile as sf
from mlx_audio.tts import load
model = load("typomonster/supertonic-2-mlx")
pieces = [np.asarray(r.audio) for r in
model.generate("์ค๋ ๋ ์จ๊ฐ ์ ๋ง ์ข๋ค์.", voice="F1", lang="ko")]
wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
sf.write("out.wav", wav, model.sample_rate)
Multi-language, multi-voice
from mlx_audio.tts import load
model = load("typomonster/supertonic-2-mlx")
cases = [
("en", "M1", "The quick brown fox jumps over the lazy dog."),
("ko", "F1", "๋ง๋ฃ์ฐ๋ ์์ฑ ๋น์์
๋๋ค."),
("es", "F3", "Hola, ยฟcรณmo estรกs hoy?"),
("pt", "M2", "Bom dia, tudo bem?"),
("fr", "F5", "Bonjour, comment รงa va ?"),
]
for lang, voice, text in cases:
for r in model.generate(text, voice=voice, lang=lang):
print(lang, voice, r.samples, r.real_time_factor)
Performance
Measured on Apple M1 Max with 5 Euler steps, post-warmup:
| Input | Audio | Wall | RTF |
|---|---|---|---|
"Hello world." (en, M1) |
1.46 s | 42 ms | 0.029ร |
"์ค๋ ์์นจ ๊ณต์์ ์ฐ์ฑ
ํ์ด์." (ko, F1) |
2.63 s | 47 ms | 0.018ร |
Lower RTF is better (<1ร means faster than real-time).
More audio samples generated with MLX: https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic
Generation options
model.generate(
text,
voice="M1", # one of M1โM5, F1โF5
lang="en", # en | ko | es | pt | fr
speed=1.05, # >1 speaks faster (scales predicted duration)
steps=5, # Euler steps; more = higher quality, slower
seed=0, # deterministic given the same seed + input
chunk_max_len=None, # override default (ko=120 chars, others=300)
silence_between_chunks=0.3, # seconds between chunks in long texts
)
Files
config.jsonโ mlx-audio model config{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensorsโ MLX weightsunicode_indexer.json,voice_styles/*.jsonโ runtime assetstts.jsonโ upstream pipeline config (preserved for reference)
References
- Upstream model: Supertone/supertonic-2
- Upstream code: supertone-inc/supertonic ยท fork with MLX integration: typomonster/supertonic
- mlx-audio (with Supertonic support): typomonster/mlx-audio ยท upstream: Blaizzy/mlx-audio
License
OpenRAIL-M (inherited from the upstream model). See LICENSE for the full
terms โ redistribution must carry the use-based restrictions (Attachment A)
forward.
- Downloads last month
- 82
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for typomonster/supertonic-2-mlx
Base model
Supertone/supertonic-2