Hindi fine-tune of MiniCPM5-1B now available + GGUF quants

#6
by pankajpandey-dev - opened

Hi @openbmb team and community! πŸ‘‹

Thanks for releasing MiniCPM5-1B β€” the tokenizer handles Devanagari beautifully (0.81 tokens/char on Hindi text) and the model is the perfect size for low-resource Indic adaptation.

I've released a Hindi instruction-tuned version trained on AI4Bharat's indic-instruct-data-v0.1 (anudesh + dolly Hindi splits, ~4k high-quality examples):

πŸ”— HF Model: https://huggingface.co/pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct
πŸ”— GGUF Quants (Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0): https://huggingface.co/pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct-v1-GGUF

Training stack: Unsloth + TRL + LoRA (r=32), 60 min on a single T4. Full details on the model card.

One note for the llama.cpp folks: the BPE pre-tokenizer hash isn't in llama.cpp's registry yet β€” I registered 36f3066e97b7f3994b379aaacde306c1444c6ae84e81a5ae3cd2b7ed3b8c42d4 β†’ qwen2 as the closest match and conversion worked cleanly. Happy to submit a PR to llama.cpp upstream if this is the right pre-tokenizer family for MiniCPM5.

Looking forward to more Indic fine-tunes of this base β€” thanks again!

OpenBMB org

Hi Pankaj, thank you so much for the great work! πŸ‘

We’re really excited to see MiniCPM5-1B adapted for Hindi instruction tuning, and the GGUF quants will be very helpful for the community.

Regarding the llama.cpp tokenizer / pre-tokenizer issue, we have already adapted a version for reference:

https://github.com/zhangtao2-1/llama.cpp/

Thanks again for the excellent contribution β€” looking forward to more fine-tuned variants built on MiniCPM5! πŸš€

Can you train on russian language?

Can you train on russian language?

I haven’t worked with Russian datasets personally yet, but it should definitely be possible to fine-tune MiniCPM5-1B for Russian as well.

The main challenge for me would be evaluation and alignment quality since I don’t know Russian. If members of the community are interested in collaborating on datasets, evaluation, or benchmarking, I’d be very happy to help with the training side πŸ™‚

I have none idea how to training it. All 1B models have so bad optimization on russian and other languages (1B model optimize only is English) end this fact not to do use small model.

Sign up or log in to comment