Instructions to use Abzalbek89/kk-tokenizer-bpe-32k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Abzalbek89/kk-tokenizer-bpe-32k with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Abzalbek89/kk-tokenizer-bpe-32k", dtype="auto") - Notebooks
- Google Colab
- Kaggle
kk-tokenizer-bpe-32k
Byte-Level BPE tokenizer for the Kazakh language (Cyrillic), vocabulary size 32,000.
Trained with huggingface/tokenizers on the cleaned Kazakh corpus
Abzalbek89/corpus_clean (~1.5M documents).
Byte-level BPE tokenizer trained from scratch on a deduplicated, language-id-filtered Kazakh corpus. Operates on raw UTF-8 bytes with GPT-2-style ByteLevel pre-tokenization.
Held-out fertility result
| Metric | Value |
|---|---|
| Rank | 1 / 13 (lower fertility is better) |
| Vocabulary size | 32,000 |
| Fertility (tokens / word) | 1.679 |
| Compression (chars / token) | 4.672 |
| Compression (bytes / token) | 8.460 |
| Total tokens / words | 4,981,192 / 2,966,207 |
Evaluated on 2,966,207 whitespace-words from the validation split of Abzalbek89/corpus_clean (14,831+ documents, ~3M words). Full ranking and methodology: Abzalbek89/kk-tokenizer-fertility-baseline.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Abzalbek89/kk-tokenizer-bpe-32k")
ids = tok.encode("Қазақстан Республикасының мемлекеттік тілі қазақ тілі.",
add_special_tokens=False)
print(len(ids), tok.tokenize("Қазақстан Республикасының мемлекеттік тілі қазақ тілі."))
Companion artifacts
- Training corpus:
Abzalbek89/corpus_clean - Fertility benchmark + reproducible scripts:
Abzalbek89/kk-tokenizer-fertility-baseline - Sibling tokenizers (BPE / Unigram / SP / morph-aware): see the dataset README for the full table.
Citation
@misc{kk_tokenizer_2026,
title = {Tokenizer Optimization for Kazakh Small Language Models},
author = {Abzalbek Ulasbek},
year = {2026},
howpublished = {Hugging Face Hub: \url{https://huggingface.co/Abzalbek89/kk-tokenizer-bpe-32k}},
}
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support