kk-tokenizer-bpe-32k

Byte-Level BPE tokenizer for the Kazakh language (Cyrillic), vocabulary size 32,000. Trained with huggingface/tokenizers on the cleaned Kazakh corpus Abzalbek89/corpus_clean (~1.5M documents).

Byte-level BPE tokenizer trained from scratch on a deduplicated, language-id-filtered Kazakh corpus. Operates on raw UTF-8 bytes with GPT-2-style ByteLevel pre-tokenization.

Held-out fertility result

Metric Value
Rank 1 / 13 (lower fertility is better)
Vocabulary size 32,000
Fertility (tokens / word) 1.679
Compression (chars / token) 4.672
Compression (bytes / token) 8.460
Total tokens / words 4,981,192 / 2,966,207

Evaluated on 2,966,207 whitespace-words from the validation split of Abzalbek89/corpus_clean (14,831+ documents, ~3M words). Full ranking and methodology: Abzalbek89/kk-tokenizer-fertility-baseline.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Abzalbek89/kk-tokenizer-bpe-32k")
ids = tok.encode("Қазақстан Республикасының мемлекеттік тілі қазақ тілі.",
                 add_special_tokens=False)
print(len(ids), tok.tokenize("Қазақстан Республикасының мемлекеттік тілі қазақ тілі."))

Companion artifacts

Citation

@misc{kk_tokenizer_2026,
  title  = {Tokenizer Optimization for Kazakh Small Language Models},
  author = {Abzalbek Ulasbek},
  year   = {2026},
  howpublished = {Hugging Face Hub: \url{https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k}},
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support