README.md · U4RASD/NeoAraBERT at main

File size: 6,250 Bytes

9ab51a4
 
 
 
 
 
 
67e0af1
 
 
9ab51a4
f048682
 
51186f3
 
67e0af1
f048682
9ab51a4
da0a9be
 
67e0af1
9ab51a4
d9a8021
9ab51a4
77f3f47
5fd9d41
 
98cc6e2
 
5fd9d41
 
 
98cc6e2
 
5fd9d41
 
 
 
 
 
77f3f47
9860bfb
77f3f47
9ee3ed7
9860bfb
 
43e511a
9860bfb
 
 
5fd9d41
9860bfb
 
77ed0cb
77f3f47
 
 
2f7f606
77f3f47
 
 
 
 
 
 
 
 
 
0e9e5dc
77f3f47
 
 
 
 
 
 
 
 
33dc0a6
77f3f47
33dc0a6
 
 
 
6c88e62
33dc0a6
 
 
 
 
 
 
 
 
 
77f3f47
 
5fd9d41
 
 
77f3f47

---
license: cc-by-sa-4.0
language:
- ar
base_model:
- U4RASD/NeoAraBERT
tags:
- neoarabert
- neobert
- bert
- MSA
- msa
- modern-standard-arabic
- Classical Arabic
- CA
- Dialect
- dialect
- masked-language-model
- Masked Langauge Model
- Arabic BERT
- custom_code
pipeline_tag: feature-extraction
library_name: Transformers
---
# NeoAraBERT
<table align="right" style="border: none; margin-left: 28px; margin-bottom: 16px; width: 290px;">
  <tr style="border: none;">
    <td align="center" style="border: none; padding: 0; padding-top: 4px;">
      <img src="https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/8ijYrACVusalZ3CIU0rk_.png" width="193.333333" style="border: none; box-shadow: none;">
    </td>
  </tr>
  <tr style="border: none;">
    <td align="center" style="border: none; padding: 0;">
      <img src="https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/jl_hN3qIJtAm-oqlH2BXW.png" width="112" style="border: none; box-shadow: none;">
    </td>
  </tr>
</table>
NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. This project was a collaboration between the Arab Center for Research and Policy Studies’ (ACRPS) Unit for Research In Arabic Social and Digital Spaces (U4RASD) and the American University of Beirut (AUB).

We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.

This is the **NeoAraBERT_Mix** checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.

The available NeoAraBERT checkpoints:
| Model | Description | Link |                                                                                                                                                                          
|---|---|---|
| NeoAraBERT (NeoAraBERT_Mix)    | Trained on both Modern Standard Arabic and Dialectal Arabic. | this repository ✅ | 
| NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |                                                                                                                                                                 
| NeoAraBERT_DA  | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) | 
  
![mix](https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/Xupu7ff-rv7bmu8NYT7Sp.png) 

For detailed benchmarking, see https://acr.ps/neoarabert.

### How to Use
Install these libraries:
```
pip install fast-disambig torch==2.5.1 transformers==4.49.0 xformers==0.0.28.post3 torchvision torchaudio
```
Load the model and use it to generate embeddings:
```python
from transformers import AutoModel, AutoTokenizer

model_name = "U4RASD/NeoAraBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "المركز العربيّ للأبحاث ودراسة السياسات."
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
print(embedding.shape)
```

### Citation
If you use the code, model, or the Muradif benchmark, please cite:
```bibtex
@inproceedings{abou-chakra-etal-2026-neoarabert,
  title = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking",
  author = "Abou Chakra, Chadi and
            Hamoud, Hadi and
            Rakan Al Mraikhat, Osama and
            Abu Obaida, Qusai and
            Ballout, Mohamad and
            Zaraket, Fadi A.",
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  address = "San Diego, California, United States",
  year = "2026",
  note = "Accepted paper",
  url = "https://acr.ps/neoarabert",
  abstract = {We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pre-train NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed more general POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task, ``Muradif'', that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants (MSA, dialectal, and mixed) rank first in 18 tasks, second in two, third in two, and fourth in one task. They show strong performance on classical and modern standard Arabic, substantial margins of improvement ($>$7\%) in two tasks, and a $+$2.75\% improvement on average across all tasks. Our code and links to checkpoints for our model variants are available on our website: \url{https://acr.ps/neoarabert}}
}
```

### Acknowledgements
We would like to acknowledge Ahmad Talal Salman from Assafir and Professor Amer Abdo Mouawad from the American University of Beirut for sharing Assafir data, which was instrumental to the work presented in this paper.

### License
This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-sa/4.0/).