AIT Piper KK
Fast, offline Kazakh text-to-speech model for Piper. The model contains seven fixed voices and supports incremental audio generation for real-time applications.
Model Characteristics
| Property | Value |
|---|---|
| Language | Kazakh (kk) |
| Architecture | Piper/VITS |
| Runtime format | ONNX |
| Sample rate | 22,050 Hz |
| Audio format | Mono PCM |
| Speakers | 7 fixed voices |
| Voice cloning | Not supported |
| Streaming | Supported through the Piper Python API |
| Model size | Approximately 74 MB |
| Phonemizer | eSpeak NG, Kazakh voice |
Speaker IDs
| Speaker ID | Voice name |
|---|---|
| 0 | kk_F1 |
| 1 | kk_F2 |
| 2 | kk_F3 |
| 3 | kk_M2 |
| 4 | kk_emo_1263201035 |
| 5 | kk_emo_399172782 |
| 6 | kk_emo_805570882 |
The voice names are identifiers, not style controls. Select a voice by passing its numeric speaker ID at inference time.
Files
Keep both files in the same directory without renaming only one of them:
kk_KZ-ait-piper-kk-medium.onnx
kk_KZ-ait-piper-kk-medium.onnx.json
The JSON sidecar contains the phoneme map, speaker mapping, sample rate, and default synthesis settings required by Piper.
Installation
pip install "piper-tts>=1.4,<2" onnxruntime
Download the repository:
huggingface-cli download nur-dev/ait-piper-kk \
--local-dir ./ait-piper-kk
Authentication is required because this repository is private.
Command-Line Inference
echo "Сәлеметсіз бе! Қазақ тіліндегі дыбыстау жүйесі жұмыс істеп тұр." |
piper \
--model ./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx \
--speaker 0 \
--output_file output.wav
Change --speaker to a value from 0 through 6.
Python Inference
import wave
from piper import PiperVoice, SynthesisConfig
model_path = "./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx"
voice = PiperVoice.load(model_path)
config = SynthesisConfig(speaker_id=3)
with wave.open("output.wav", "wb") as wav_file:
voice.synthesize_wav(
"Бүгін Алматыда күн ашық, ауа райы жылы болады.",
wav_file,
syn_config=config,
)
Load PiperVoice once when the application starts and reuse it for subsequent
requests. Repeated model loading adds avoidable latency.
Streaming Inference
synthesize yields audio incrementally. Send each chunk to the playback,
telephony, or network output as soon as it arrives.
from piper import PiperVoice, SynthesisConfig
voice = PiperVoice.load(
"./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx"
)
config = SynthesisConfig(speaker_id=6)
for chunk in voice.synthesize(
"Қазақ тіліндегі жылдам дыбыстау жүйесіне қош келдіңіз.",
syn_config=config,
):
# chunk.audio_int16_bytes contains mono signed 16-bit PCM.
# chunk.sample_rate is 22050 for this model.
send_audio(
chunk.audio_int16_bytes,
sample_rate=chunk.sample_rate,
sample_width=chunk.sample_width,
channels=chunk.sample_channels,
)
Synthesis Controls
config = SynthesisConfig(
speaker_id=0,
length_scale=1.0,
noise_scale=0.667,
noise_w_scale=0.8,
normalize_audio=True,
)
length_scale: values above1.0speak more slowly; values below1.0speak faster.noise_scale: controls acoustic variation. Large changes may reduce stability.noise_w_scale: controls duration variation.normalize_audio: keeps output level consistent for common playback use.
The defaults in the model configuration are recommended. Tune one setting at a time and validate every speaker used by the application.
Best Practices
- Use correctly spelled Kazakh Cyrillic text, including
ә,ғ,қ,ң,ө,ұ,ү,һ, andі. - Write numbers, abbreviations, dates, currencies, and symbols as words when pronunciation must be predictable.
- Preserve punctuation. Commas and sentence boundaries improve phrasing.
- Split long paragraphs into complete sentences and stream them in order.
- Keep the model and JSON sidecar together.
- Reuse one loaded model instance instead of loading it per request.
- Choose the speaker explicitly; do not depend on an implicit default.
- For telephony or another sample rate, synthesize at 22,050 Hz first and use a proper audio resampler afterward.
- Avoid extreme synthesis-control values in production without listening tests.
Scope
This model is intended for Kazakh text. Russian, English, mixed-language text, voice cloning, and arbitrary speaker embeddings are outside its supported scope.