DunbaaBERT

DunbaaBERT is a family of Urdu RoBERTa-base encoder models trained from scratch on a deduplicated 17 GB Urdu corpus. The models use Byte-BPE vocabularies of 32k, 52k, and 96k tokens and are released under the MIT license.

Model Details

Model type: RoBERTa-style masked language model
Language: Urdu
Architecture: Encoder-only Transformer
Training objective: Masked Language Modeling with Whole Word Masking (WWM)
Sequence length: 512 tokens
Training corpus: 17 GB deduplicated Urdu text

Model Variants

Model	Vocabulary Size	Parameters
DunbaaBERT-32k	32,009	110,625,024
DunbaaBERT-52k	52,009	125,985,024
DunbaaBERT-96k	96,009	159,777,024

Training Data

The final corpus was constructed from multiple Urdu resources and deduplicated at line level.

Corpus	Size
mC4	17.0 GB
OSCAR-2019	869 MB
OSCAR-2109	604 MB
OSCAR-2201	344 MB
OSCAR-2301	982 MB
Urdu Wikipedia	364 MB
Filtered NLLB Urdu	2.1 GB
Total before deduplication	22.3 GB
Final corpus	17.0 GB

Pre-training

96k vocab size
100k training steps
computed on 2x H100 with 8k batch size

Evaluation Results

Main Results

Model	UrBLiMP	COUNT19 F1	USADC F1	PSL-Kabaddi F1	IMDB Urdu F1	Avg. Norm. Eff.
DunbaaBERT-32k	95.1	94.44	94.08	70.08	90.13	0.859
DunbaaBERT-52k	97.0	94.91	91.75	67.60	90.14	0.795
DunbaaBERT-96k	94.6	95.22	89.97	70.53	90.65	0.813
Urdu-RoBERTa-small	90.5	92.08	85.36	67.06	84.72	0.781
HPLT-BERT-ur	97.3	95.71	93.51	71.11	89.69	0.597
mBERT	75.5	90.88	83.03	65.78	85.47	0.744
mmBERT-small	89.5	92.36	73.09	70.36	85.44	0.494
mmBERT-base	92.4	93.97	77.77	67.75	87.31	0.495
XLM-R-base	89.6	93.72	85.22	60.56	88.69	0.754
XLM-R-large	94.1	94.38	83.55	69.62	91.15	0.492

Efficiency

We report a normalized efficiency metric combining Macro-F1 and inference throughput. Across benchmarks, the DunbaaBERT family consistently achieved stronger performance-efficiency trade-offs than most multilingual baselines.

DunbaaBERT-52k achieved the strongest linguistic probing performance on UrBLiMP, while DunbaaBERT-32k provided the strongest overall efficiency profile. Interestingly, DunbaaBERT-96k ranked second in average efficiency despite having the largest vocabulary.

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citation

@misc{maab2026dunbaabertsacrificesemantics,
      title={DunbaaBERT: From Sacrifice to Semantics}, 
      author={Iffat Maab and Waleed Jamil and Raphael Schmitt},
      year={2026},
      eprint={2605.26935},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.26935}, 
}

License

MIT License

Downloads last month: 1,004

Safetensors

Model size

0.2B params

Tensor type

I64

F32

Dataset used to train DunbaaBERT/DunbaaBERT_96k_base

Paper for DunbaaBERT/DunbaaBERT_96k_base

DunbaaBERT: From Sacrifice to Semantics

Paper • 2605.26935 • Published 9 days ago