Instructions to use DunbaaBERT/DunbaaBERT_96k_base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DunbaaBERT/DunbaaBERT_96k_base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="DunbaaBERT/DunbaaBERT_96k_base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("DunbaaBERT/DunbaaBERT_96k_base") model = AutoModelForMaskedLM.from_pretrained("DunbaaBERT/DunbaaBERT_96k_base") - Notebooks
- Google Colab
- Kaggle
DunbaaBERT
DunbaaBERT is a family of Urdu RoBERTa-base encoder models trained from scratch on a deduplicated 17 GB Urdu corpus. The models use Byte-BPE vocabularies of 32k, 52k, and 96k tokens and are released under the MIT license.
Model Details
- Model type: RoBERTa-style masked language model
- Language: Urdu
- Architecture: Encoder-only Transformer
- Training objective: Masked Language Modeling with Whole Word Masking (WWM)
- Sequence length: 512 tokens
- Training corpus: 17 GB deduplicated Urdu text
Model Variants
| Model | Vocabulary Size | Parameters |
|---|---|---|
| DunbaaBERT-32k | 32,009 | 110,625,024 |
| DunbaaBERT-52k | 52,009 | 125,985,024 |
| DunbaaBERT-96k | 96,009 | 159,777,024 |
Training Data
The final corpus was constructed from multiple Urdu resources and deduplicated at line level.
| Corpus | Size |
|---|---|
| mC4 | 17.0 GB |
| OSCAR-2019 | 869 MB |
| OSCAR-2109 | 604 MB |
| OSCAR-2201 | 344 MB |
| OSCAR-2301 | 982 MB |
| Urdu Wikipedia | 364 MB |
| Filtered NLLB Urdu | 2.1 GB |
| Total before deduplication | 22.3 GB |
| Final corpus | 17.0 GB |
Pre-training
- 96k vocab size
- 100k training steps
- computed on 2x H100 with 8k batch size
Evaluation Results
Main Results
| Model | UrBLiMP | COUNT19 F1 | USADC F1 | PSL-Kabaddi F1 | IMDB Urdu F1 | Avg. Norm. Eff. |
|---|---|---|---|---|---|---|
| DunbaaBERT-32k | 95.1 | 94.44 | 94.08 | 70.08 | 90.13 | 0.859 |
| DunbaaBERT-52k | 97.0 | 94.91 | 91.75 | 67.60 | 90.14 | 0.795 |
| DunbaaBERT-96k | 94.6 | 95.22 | 89.97 | 70.53 | 90.65 | 0.813 |
| Urdu-RoBERTa-small | 90.5 | 92.08 | 85.36 | 67.06 | 84.72 | 0.781 |
| HPLT-BERT-ur | 97.3 | 95.71 | 93.51 | 71.11 | 89.69 | 0.597 |
| mBERT | 75.5 | 90.88 | 83.03 | 65.78 | 85.47 | 0.744 |
| mmBERT-small | 89.5 | 92.36 | 73.09 | 70.36 | 85.44 | 0.494 |
| mmBERT-base | 92.4 | 93.97 | 77.77 | 67.75 | 87.31 | 0.495 |
| XLM-R-base | 89.6 | 93.72 | 85.22 | 60.56 | 88.69 | 0.754 |
| XLM-R-large | 94.1 | 94.38 | 83.55 | 69.62 | 91.15 | 0.492 |
Efficiency
We report a normalized efficiency metric combining Macro-F1 and inference throughput. Across benchmarks, the DunbaaBERT family consistently achieved stronger performance-efficiency trade-offs than most multilingual baselines.
DunbaaBERT-52k achieved the strongest linguistic probing performance on UrBLiMP, while DunbaaBERT-32k provided the strongest overall efficiency profile. Interestingly, DunbaaBERT-96k ranked second in average efficiency despite having the largest vocabulary.
Fairseq Checkpoint
Get the fairseq checkpoint here.
Citation
@misc{maab2026dunbaabertsacrificesemantics,
title={DunbaaBERT: From Sacrifice to Semantics},
author={Iffat Maab and Waleed Jamil and Raphael Schmitt},
year={2026},
eprint={2605.26935},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.26935},
}
License
MIT License
- Downloads last month
- 1,004