You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIT Piper KK

Fast, offline Kazakh text-to-speech model for Piper. The model contains seven fixed voices and supports incremental audio generation for real-time applications.

Model Characteristics

Property Value
Language Kazakh (kk)
Architecture Piper/VITS
Runtime format ONNX
Sample rate 22,050 Hz
Audio format Mono PCM
Speakers 7 fixed voices
Voice cloning Not supported
Streaming Supported through the Piper Python API
Model size Approximately 74 MB
Phonemizer eSpeak NG, Kazakh voice

Speaker IDs

Speaker ID Voice name
0 kk_F1
1 kk_F2
2 kk_F3
3 kk_M2
4 kk_emo_1263201035
5 kk_emo_399172782
6 kk_emo_805570882

The voice names are identifiers, not style controls. Select a voice by passing its numeric speaker ID at inference time.

Files

Keep both files in the same directory without renaming only one of them:

kk_KZ-ait-piper-kk-medium.onnx
kk_KZ-ait-piper-kk-medium.onnx.json

The JSON sidecar contains the phoneme map, speaker mapping, sample rate, and default synthesis settings required by Piper.

Installation

pip install "piper-tts>=1.4,<2" onnxruntime

Download the repository:

huggingface-cli download nur-dev/ait-piper-kk \
  --local-dir ./ait-piper-kk

Authentication is required because this repository is private.

Command-Line Inference

echo "Сәлеметсіз бе! Қазақ тіліндегі дыбыстау жүйесі жұмыс істеп тұр." |
  piper \
    --model ./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx \
    --speaker 0 \
    --output_file output.wav

Change --speaker to a value from 0 through 6.

Python Inference

import wave

from piper import PiperVoice, SynthesisConfig

model_path = "./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx"
voice = PiperVoice.load(model_path)

config = SynthesisConfig(speaker_id=3)

with wave.open("output.wav", "wb") as wav_file:
    voice.synthesize_wav(
        "Бүгін Алматыда күн ашық, ауа райы жылы болады.",
        wav_file,
        syn_config=config,
    )

Load PiperVoice once when the application starts and reuse it for subsequent requests. Repeated model loading adds avoidable latency.

Streaming Inference

synthesize yields audio incrementally. Send each chunk to the playback, telephony, or network output as soon as it arrives.

from piper import PiperVoice, SynthesisConfig

voice = PiperVoice.load(
    "./ait-piper-kk/kk_KZ-ait-piper-kk-medium.onnx"
)
config = SynthesisConfig(speaker_id=6)

for chunk in voice.synthesize(
    "Қазақ тіліндегі жылдам дыбыстау жүйесіне қош келдіңіз.",
    syn_config=config,
):
    # chunk.audio_int16_bytes contains mono signed 16-bit PCM.
    # chunk.sample_rate is 22050 for this model.
    send_audio(
        chunk.audio_int16_bytes,
        sample_rate=chunk.sample_rate,
        sample_width=chunk.sample_width,
        channels=chunk.sample_channels,
    )

Synthesis Controls

config = SynthesisConfig(
    speaker_id=0,
    length_scale=1.0,
    noise_scale=0.667,
    noise_w_scale=0.8,
    normalize_audio=True,
)
  • length_scale: values above 1.0 speak more slowly; values below 1.0 speak faster.
  • noise_scale: controls acoustic variation. Large changes may reduce stability.
  • noise_w_scale: controls duration variation.
  • normalize_audio: keeps output level consistent for common playback use.

The defaults in the model configuration are recommended. Tune one setting at a time and validate every speaker used by the application.

Best Practices

  • Use correctly spelled Kazakh Cyrillic text, including ә, ғ, қ, ң, ө, ұ, ү, һ, and і.
  • Write numbers, abbreviations, dates, currencies, and symbols as words when pronunciation must be predictable.
  • Preserve punctuation. Commas and sentence boundaries improve phrasing.
  • Split long paragraphs into complete sentences and stream them in order.
  • Keep the model and JSON sidecar together.
  • Reuse one loaded model instance instead of loading it per request.
  • Choose the speaker explicitly; do not depend on an implicit default.
  • For telephony or another sample rate, synthesize at 22,050 Hz first and use a proper audio resampler afterward.
  • Avoid extreme synthesis-control values in production without listening tests.

Scope

This model is intended for Kazakh text. Russian, English, mixed-language text, voice cloning, and arbitrary speaker embeddings are outside its supported scope.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support