Kokoro-82M โ€” Core AI

hexgrad/Kokoro-82M (Apache-2.0), a tiny high-quality StyleTTS2 + iSTFTNet text-to-speech model (82M params, 24 kHz), converted to Apple Core AI (.aimodel, iOS 27 / macOS 27) โ€” the CoreAI-Model-Zoo's first TTS.

Non-autoregressive: phonemes + a voice/style vector โ†’ a waveform in one pass. Runs fully on-device, English-first, with graphemeโ†’phoneme on the host.

Use it

โ–ถ๏ธ Run it (source) โ€” the Speak runner (GUI + CLI, one app for every text-to-speech model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/Speak/Speak.xcodeproj
# โ†’ Run, then pick "Kokoro 82M" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/Speak
swift run speak-cli --model kokoro-82m --text "Hello from Core AI." --output hello.wav

๐Ÿ’ป Build with it โ€” complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let speaker = try await KitSpeaker(catalog: "kokoro-82m")
let audio = try await speaker.synthesize(text)
// audio.samples: 24 kHz mono PCM in [-1, 1] โ€” play it or write a WAV

The take-home is Examples/Speak/Sources/QuickStart.swift โ€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitSpeaker(catalog:) and plays the samples. English-first: G2P is a dictionary over the bundled misaki lexicons (~180k words); out-of-dictionary words are letter-spelled (no neural fallback). 28 voices ride the download โ€” af_heart is the default; the underlying KokoroTTS takes a voice: label. Streaming? synthesizeStreaming(_:onChunk:) hands you a chunk per sentence.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit โ†’ product CoreAIKit
  • Info.plist: none needed
  • Entitlements: none needed
  • First run downloads the model โ€” 0.3 GB (Mac) โ€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release โ€” Debug is ~3ร— slower on per-token host work

Bundles

The acoustic graph has one data-dependent length (the durationโ†’alignment expansion), so it is cut into three voice-independent .aimodel bundles with two cheap host steps between them:

file in โ†’ out
kokoro_predictor.aimodel input_ids[1,128] i32, ref_s[1,256], attn_mask[1,128] โ†’ duration, d, t_en
kokoro_prosody.aimodel d, t_en, aln[1,128,512], ref_s, frame_mask[1,512] โ†’ asr, F0, N
kokoro_vocoder.aimodel asr, F0, N, har, ref_s, frame_mask โ†’ audio[1, Lยท600]

voices/*.pt โ€” the 28 English voice packs (Apache-2.0). The voice is the ref_s input: ref_s = pack[len(ids)โˆ’1]. Quality leaders: af_heart, af_bella, af_nicole, bf_emma.

Token length T and frame length L are fixed buckets (128 / 512); the host left-pads to the bucket and trims the output. Longer text is split into sentences host-side. Run on the Core AI CPU compute unit. ~0.75 s / utterance on M4 Max, ~335 MB total (fp32).

Host steps

text โ”€โ”€(misaki G2P)โ”€โ”€โ–ถ ids โ”€โ”€โ–ถ predictor โ”€โ”€โ–ถ [build alignment] โ”€โ”€โ–ถ prosody
     โ”€โ”€โ–ถ [har = STFT(SineGen(f0_upsamp(F0)))] โ”€โ”€โ–ถ vocoder โ”€โ”€โ–ถ [trim] โ”€โ”€โ–ถ 24 kHz audio

G2P is misaki (misaki[en], no espeak for English); on-device MisakiSwift gives the same English phonemes. har (the hn-nsf source's STFT) is a windowed FFT computed on the host โ€” the one piece that must stay off the engine (its atan2 phase flips 2ฯ€ at the F0โ†’0 pad boundary under fp32).

Quality

The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is spectral: magnitude-spectrogram correlation 0.999 vs the PyTorch reference (af_heart, multiple sentences). Raw waveform correlation ~0.98 โ€” the bounded, inaudible effect of the bucket pad boundary.

Convert / re-bucket

conversion/export_kokoro.py (python export_kokoro.py --out-dir out; --verify runs the engine-vs-torch spectral gate; --token-bucket / --frame-bucket to re-size). Card + the full port write-up: zoo/kokoro-82m.md.

License

Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives from Apple's BSD-3-Clause coreai_models.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlboydaisuke/Kokoro-82M-CoreAI

Finetuned
(39)
this model