Supertonic-2 (MLX)

Supertonic-2-MLX is a pure-MLX port of Supertone/supertonic-2, a lightning-fast on-device TTS system. It runs natively on Apple Silicon through mlx-audio โ€” no ONNX Runtime, no Python inference server, just mx.load + Metal.

  • 66M params, 4 sub-models (duration predictor, text encoder, flow-matching vector estimator, Vocos-style vocoder).
  • 5 languages: English, Korean, Spanish, Portuguese, French.
  • 10 preset voices: M1โ€“M5 (male), F1โ€“F5 (female).
  • 44.1 kHz output, ~0.03 RTF on M4 Pro with 5 Euler steps.
  • float32 parity with the upstream ONNX Runtime pipeline.

Install

Supertonic support hasn't been upstreamed to mlx-audio yet โ€” install the fork typomonster/mlx-audio:

pip install git+https://github.com/typomonster/mlx-audio.git

Quick start

from mlx_audio.tts import load

# Downloads this repo on first run and caches under ~/.cache/huggingface/.
model = load("typomonster/supertonic-2-mlx")

for r in model.generate("Hello world.", voice="M1", lang="en"):
    # r.audio is an mx.array at model.sample_rate (44100 Hz)
    print(r.samples, r.real_time_factor)

Save to WAV

import numpy as np, soundfile as sf
from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")
pieces = [np.asarray(r.audio) for r in
          model.generate("์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.", voice="F1", lang="ko")]
wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
sf.write("out.wav", wav, model.sample_rate)

Multi-language, multi-voice

from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")

cases = [
    ("en", "M1", "The quick brown fox jumps over the lazy dog."),
    ("ko", "F1", "๋ง๋“ฃ์“ฐ๋Š” ์Œ์„ฑ ๋น„์„œ์ž…๋‹ˆ๋‹ค."),
    ("es", "F3", "Hola, ยฟcรณmo estรกs hoy?"),
    ("pt", "M2", "Bom dia, tudo bem?"),
    ("fr", "F5", "Bonjour, comment รงa va ?"),
]
for lang, voice, text in cases:
    for r in model.generate(text, voice=voice, lang=lang):
        print(lang, voice, r.samples, r.real_time_factor)

Performance

Measured on Apple M1 Max with 5 Euler steps, post-warmup:

Input Audio Wall RTF
"Hello world." (en, M1) 1.46 s 42 ms 0.029ร—
"์˜ค๋Š˜ ์•„์นจ ๊ณต์›์„ ์‚ฐ์ฑ…ํ–ˆ์–ด์š”." (ko, F1) 2.63 s 47 ms 0.018ร—

Lower RTF is better (<1ร— means faster than real-time).

More audio samples generated with MLX: https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic

Generation options

model.generate(
    text,
    voice="M1",           # one of M1โ€“M5, F1โ€“F5
    lang="en",            # en | ko | es | pt | fr
    speed=1.05,           # >1 speaks faster (scales predicted duration)
    steps=5,              # Euler steps; more = higher quality, slower
    seed=0,               # deterministic given the same seed + input
    chunk_max_len=None,   # override default (ko=120 chars, others=300)
    silence_between_chunks=0.3,  # seconds between chunks in long texts
)

Files

  • config.json โ€” mlx-audio model config
  • {duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors โ€” MLX weights
  • unicode_indexer.json, voice_styles/*.json โ€” runtime assets
  • tts.json โ€” upstream pipeline config (preserved for reference)

References

License

OpenRAIL-M (inherited from the upstream model). See LICENSE for the full terms โ€” redistribution must carry the use-based restrictions (Attachment A) forward.

Downloads last month
82
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for typomonster/supertonic-2-mlx

Finetuned
(1)
this model