DunbaaBERT

DunbaaBERT is a family of Urdu RoBERTa-base encoder models trained from scratch on a deduplicated 17 GB Urdu corpus. The models use Byte-BPE vocabularies of 32k, 52k, and 96k tokens and are released under the MIT license.

Model Details

  • Model type: RoBERTa-style masked language model
  • Language: Urdu
  • Architecture: Encoder-only Transformer
  • Training objective: Masked Language Modeling with Whole Word Masking (WWM)
  • Sequence length: 512 tokens
  • Training corpus: 17 GB deduplicated Urdu text

Model Variants

Model Vocabulary Size Parameters
DunbaaBERT-32k 32,009 110,625,024
DunbaaBERT-52k 52,009 125,985,024
DunbaaBERT-96k 96,009 159,777,024

Training Data

The final corpus was constructed from multiple Urdu resources and deduplicated at line level.

Corpus Size
mC4 17.0 GB
OSCAR-2019 869 MB
OSCAR-2109 604 MB
OSCAR-2201 344 MB
OSCAR-2301 982 MB
Urdu Wikipedia 364 MB
Filtered NLLB Urdu 2.1 GB
Total before deduplication 22.3 GB
Final corpus 17.0 GB

Pre-training

  • 96k vocab size
  • 100k training steps
  • computed on 2x H100 with 8k batch size

Evaluation Results

Main Results

Model UrBLiMP COUNT19 F1 USADC F1 PSL-Kabaddi F1 IMDB Urdu F1 Avg. Norm. Eff.
DunbaaBERT-32k 95.1 94.44 94.08 70.08 90.13 0.859
DunbaaBERT-52k 97.0 94.91 91.75 67.60 90.14 0.795
DunbaaBERT-96k 94.6 95.22 89.97 70.53 90.65 0.813
Urdu-RoBERTa-small 90.5 92.08 85.36 67.06 84.72 0.781
HPLT-BERT-ur 97.3 95.71 93.51 71.11 89.69 0.597
mBERT 75.5 90.88 83.03 65.78 85.47 0.744
mmBERT-small 89.5 92.36 73.09 70.36 85.44 0.494
mmBERT-base 92.4 93.97 77.77 67.75 87.31 0.495
XLM-R-base 89.6 93.72 85.22 60.56 88.69 0.754
XLM-R-large 94.1 94.38 83.55 69.62 91.15 0.492

Efficiency

We report a normalized efficiency metric combining Macro-F1 and inference throughput. Across benchmarks, the DunbaaBERT family consistently achieved stronger performance-efficiency trade-offs than most multilingual baselines.

DunbaaBERT-52k achieved the strongest linguistic probing performance on UrBLiMP, while DunbaaBERT-32k provided the strongest overall efficiency profile. Interestingly, DunbaaBERT-96k ranked second in average efficiency despite having the largest vocabulary.

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citation

@misc{maab2026dunbaabertsacrificesemantics,
      title={DunbaaBERT: From Sacrifice to Semantics}, 
      author={Iffat Maab and Waleed Jamil and Raphael Schmitt},
      year={2026},
      eprint={2605.26935},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.26935}, 
}

License

MIT License

Downloads last month
1,004
Safetensors
Model size
0.2B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DunbaaBERT/DunbaaBERT_96k_base

Paper for DunbaaBERT/DunbaaBERT_96k_base