kk-tokenizer-bpe-32k

Byte-Level BPE tokenizer for the Kazakh language (Cyrillic), vocabulary size 32,000. Trained with huggingface/tokenizers on the cleaned Kazakh corpus Abzalbek89/corpus_clean (~1.5M documents).

Byte-level BPE tokenizer trained from scratch on a deduplicated, language-id-filtered Kazakh corpus. Operates on raw UTF-8 bytes with GPT-2-style ByteLevel pre-tokenization.

Held-out fertility result

Metric	Value
Rank	1 / 13 (lower fertility is better)
Vocabulary size	32,000
Fertility (tokens / word)	1.679
Compression (chars / token)	4.672
Compression (bytes / token)	8.460
Total tokens / words	4,981,192 / 2,966,207

Evaluated on 2,966,207 whitespace-words from the validation split of Abzalbek89/corpus_clean (14,831+ documents, ~3M words). Full ranking and methodology: Abzalbek89/kk-tokenizer-fertility-baseline.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Abzalbek89/kk-tokenizer-bpe-32k")
ids = tok.encode("Қазақстан Республикасының мемлекеттік тілі қазақ тілі.",
                 add_special_tokens=False)
print(len(ids), tok.tokenize("Қазақстан Республикасының мемлекеттік тілі қазақ тілі."))

Companion artifacts

Training corpus: Abzalbek89/corpus_clean
Fertility benchmark + reproducible scripts: Abzalbek89/kk-tokenizer-fertility-baseline
Sibling tokenizers (BPE / Unigram / SP / morph-aware): see the dataset README for the full table.

Citation

@misc{kk_tokenizer_2026,
  title  = {Tokenizer Optimization for Kazakh Small Language Models},
  author = {Abzalbek Ulasbek},
  year   = {2026},
  howpublished = {Hugging Face Hub: \url{https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k}},
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support