File size: 9,369 Bytes

---
language: ar
language_name: Arabic
language_family: arabic
tags:
  - wikilangs
  - nlp
  - tokenizer
  - embeddings
  - n-gram
  - markov
  - wikipedia
  - feature-extraction
  - sentence-similarity
  - tokenization
  - n-grams
  - markov-chain
  - text-mining
  - fasttext
  - babelvec
  - vocabulous
  - vocabulary
  - monolingual
  - family-arabic
license: mit
library_name: wikilangs
pipeline_tag: text-generation
datasets:
  - omarkamali/wikipedia-monthly
dataset_info:
  name: wikipedia-monthly
  description: Monthly snapshots of Wikipedia articles across 300+ languages
metrics:
  - name: best_compression_ratio
    type: compression
    value: 4.347
  - name: best_isotropy
    type: isotropy
    value: 0.8111
  - name: best_alignment_r10
    type: alignment
    value: 0.7660
  - name: vocabulary_size
    type: vocab
    value: 986324
generated: 2026-03-04
---

# Arabic — Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Arabic** Wikipedia by [Wikilangs](https://wikilangs.org).

🌐 [Language Page](https://wikilangs.org/languages/ar/) · 🎮 [Playground](https://wikilangs.org/playground/?lang=ar) · 📊 [Full Research Report](RESEARCH_REPORT.md)

## Language Samples

Example sentences drawn from the Arabic Wikipedia corpus:

> تصغير K \ كي \ هو الحرف الحادي العشر في الأبجدية The Oxford English Dictionary, 2nd ed., online ويمثل هذا الحرف الصوت الطبقي الوقفي المهموس في الكيمياء، يرمز K لعنصر البوتاسيوم مراجع لاتينية

> : إحدى ولايات الولايات المتحدة الأمريكية. مدينة نيويورك: أكبر مدن الولايات المتحدة الأمريكية وإحدى أكبرها في العالم. مقاطعة نيويورك: إحدى مقاطعات ولاية نيويورك. توضيح أسماء أماكن

> أبو إبراهيم الفارابي أديب نحوي لغوي أبو نصر محمد الفارابي فيلسوف مشائي مسلم وطبيب

> إسحاق نيوتن عالم إنجليزي نيوتن وحدة قياس القوة. ذكور إنجليزية توضيح أسماء أماكن

> بوتان (مملكة) بوتان مملكة في جبال الهمالايا بين الهند والصين. بوتان (كيمياء) أحد الألكانات، يتكون من أربع ذرات كربون.

## Quick Start

### Load the Tokenizer

```python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("ar_tokenizer_32k.model")

text = "استوديوهات أفلام والت ديزني أفلام والت ديزني منتجع والت ديزني العالمي ديزني لاند"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))
```

<details>
<summary><b>Tokenization examples (click to expand)</b></summary>

**Sample 1:** `استوديوهات أفلام والت ديزني أفلام والت ديزني منتجع والت ديزني العالمي ديزني لاند…`

| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁است ودي وه ات ▁أفلام ▁والت ▁دي ز ني ▁أفلام … (+22 more)` | 32 |
| 16k | `▁است ودي وهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منت … (+10 more)` | 20 |
| 32k | `▁استوديوهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منتجع ▁والت ▁ديزني … (+7 more)` | 17 |
| 64k | `▁استوديوهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منتجع ▁والت ▁ديزني … (+7 more)` | 17 |

**Sample 2:** `باسكال قد تعني: الباسكال، وحدة قياس الضغط لغة باسكال، لغة برمجة الفيلسوف باسكال،…`

| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁با سك ال ▁قد ▁تعني : ▁البا سك ال ، … (+29 more)` | 39 |
| 16k | `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+18 more)` | 28 |
| 32k | `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+15 more)` | 25 |
| 64k | `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+15 more)` | 25 |

**Sample 3:** `جمهورية الكونغو الديمقراطية، زائير سابقًا، عاصمتها كينشاسا. جمهورية الكونغو، عاص…`

| Vocab | Tokens | Count |
|-------|--------|-------|
| 8k | `▁جمهورية ▁الكون غو ▁الديمقراطية ، ▁ز ائ ير ▁سابق ًا … (+21 more)` | 31 |
| 16k | `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁ز ائ ير ▁سابقًا ، ▁عاصمتها … (+16 more)` | 26 |
| 32k | `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁زائ ير ▁سابقًا ، ▁عاصمتها ▁كينشاسا … (+12 more)` | 22 |
| 64k | `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁زائير ▁سابقًا ، ▁عاصمتها ▁كينشاسا . … (+10 more)` | 20 |

</details>

### Load Word Embeddings

```python
from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")
```

### Load N-gram Model

```python
import pyarrow.parquet as pq

df = pq.read_table("ar_3gram_word.parquet").to_pandas()
print(df.head())
```

## Models Overview

![Performance Dashboard](visualizations/performance_dashboard.png)

| Category | Assets |
|----------|--------|
| Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes |
| N-gram models | 2 / 3 / 4 / 5-gram (word & subword) |
| Markov chains | Context 1–5 (word & subword) |
| Embeddings | 32d, 64d, 128d — mono & aligned |
| Vocabulary | Full frequency list + Zipf analysis |
| Statistics | Corpus & model statistics JSON |

## Metrics Summary

| Component | Model | Key Metric | Value |
|-----------|-------|------------|-------|
| Tokenizer | 8k BPE | Compression | 3.25x |
| Tokenizer | 16k BPE | Compression | 3.65x |
| Tokenizer | 32k BPE | Compression | 4.03x |
| Tokenizer | 64k BPE | Compression | 4.35x 🏆 |
| N-gram | 2-gram (subword) | Perplexity | 426 🏆 |
| N-gram | 2-gram (word) | Perplexity | 359,826 |
| N-gram | 3-gram (subword) | Perplexity | 4,163 |
| N-gram | 3-gram (word) | Perplexity | 775,988 |
| N-gram | 4-gram (subword) | Perplexity | 27,277 |
| N-gram | 4-gram (word) | Perplexity | 1,494,234 |
| N-gram | 5-gram (subword) | Perplexity | 133,736 |
| N-gram | 5-gram (word) | Perplexity | 1,059,510 |
| Markov | ctx-1 (subword) | Predictability | 0.0% |
| Markov | ctx-1 (word) | Predictability | 0.0% |
| Markov | ctx-2 (subword) | Predictability | 17.3% |
| Markov | ctx-2 (word) | Predictability | 67.4% |
| Markov | ctx-3 (subword) | Predictability | 29.5% |
| Markov | ctx-3 (word) | Predictability | 89.5% |
| Markov | ctx-4 (subword) | Predictability | 35.2% |
| Markov | ctx-4 (word) | Predictability | 96.5% 🏆 |
| Vocabulary | full | Size | 986,324 |
| Vocabulary | full | Zipf R² | 0.9920 |
| Embeddings | mono_32d | Isotropy | 0.8111 |
| Embeddings | mono_64d | Isotropy | 0.7841 |
| Embeddings | mono_128d | Isotropy | 0.7556 |
| Embeddings | aligned_32d | Isotropy | 0.8111 🏆 |
| Embeddings | aligned_64d | Isotropy | 0.7841 |
| Embeddings | aligned_128d | Isotropy | 0.7556 |
| Alignment | aligned_32d | R@1 / R@5 / R@10 | 13.4% / 35.0% / 48.6% |
| Alignment | aligned_64d | R@1 / R@5 / R@10 | 28.6% / 54.0% / 65.6% |
| Alignment | aligned_128d | R@1 / R@5 / R@10 | 37.2% / 65.0% / 76.6% 🏆 |

📊 **[Full ablation study, per-model breakdowns, and interpretation guide →](RESEARCH_REPORT.md)**

---

## About

Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) — monthly snapshots of 300+ Wikipedia languages.

A project by **[Wikilangs](https://wikilangs.org)** · Maintainer: [Omar Kamali](https://omarkamali.com) · [Omneity Labs](https://omneitylabs.com)

### Citation

```bibtex
@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}
```

### Links

- 🌐 [wikilangs.org](https://wikilangs.org)
- 🌍 [Language page](https://wikilangs.org/languages/ar/)
- 🎮 [Playground](https://wikilangs.org/playground/?lang=ar)
- 🤗 [HuggingFace models](https://huggingface.co/wikilangs)
- 📊 [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
- 👤 [Omar Kamali](https://huggingface.co/omarkamali)
- 🤝 Sponsor: [Featherless AI](https://featherless.ai)

**License:** MIT — free for academic and commercial use.

---
*Generated by Wikilangs Pipeline · 2026-03-04 13:56:39*