ar / README.md

Upload all models and assets for ar (latest)

e240c58 verified 3 months ago

9.37 kB

	---
	language: ar
	language_name: Arabic
	language_family: arabic
	tags:
	- wikilangs
	- nlp
	- tokenizer
	- embeddings
	- n-gram
	- markov
	- wikipedia
	- feature-extraction
	- sentence-similarity
	- tokenization
	- n-grams
	- markov-chain
	- text-mining
	- fasttext
	- babelvec
	- vocabulous
	- vocabulary
	- monolingual
	- family-arabic
	license: mit
	library_name: wikilangs
	pipeline_tag: text-generation
	datasets:
	- omarkamali/wikipedia-monthly
	dataset_info:
	name: wikipedia-monthly
	description: Monthly snapshots of Wikipedia articles across 300+ languages
	metrics:
	- name: best_compression_ratio
	type: compression
	value: 4.347
	- name: best_isotropy
	type: isotropy
	value: 0.8111
	- name: best_alignment_r10
	type: alignment
	value: 0.7660
	- name: vocabulary_size
	type: vocab
	value: 986324
	generated: 2026-03-04
	---

	# Arabic — Wikilangs Models

	Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Arabic Wikipedia by [Wikilangs](https://wikilangs.org).

	🌐 [Language Page](https://wikilangs.org/languages/ar/) · 🎮 [Playground](https://wikilangs.org/playground/?lang=ar) · 📊 [Full Research Report](RESEARCH_REPORT.md)

	## Language Samples

	Example sentences drawn from the Arabic Wikipedia corpus:

	> تصغير K \ كي \ هو الحرف الحادي العشر في الأبجدية The Oxford English Dictionary, 2nd ed., online ويمثل هذا الحرف الصوت الطبقي الوقفي المهموس في الكيمياء، يرمز K لعنصر البوتاسيوم مراجع لاتينية

	> : إحدى ولايات الولايات المتحدة الأمريكية. مدينة نيويورك: أكبر مدن الولايات المتحدة الأمريكية وإحدى أكبرها في العالم. مقاطعة نيويورك: إحدى مقاطعات ولاية نيويورك. توضيح أسماء أماكن

	> أبو إبراهيم الفارابي أديب نحوي لغوي أبو نصر محمد الفارابي فيلسوف مشائي مسلم وطبيب

	> إسحاق نيوتن عالم إنجليزي نيوتن وحدة قياس القوة. ذكور إنجليزية توضيح أسماء أماكن

	> بوتان (مملكة) بوتان مملكة في جبال الهمالايا بين الهند والصين. بوتان (كيمياء) أحد الألكانات، يتكون من أربع ذرات كربون.

	## Quick Start

	### Load the Tokenizer

	```python
	import sentencepiece as spm

	sp = spm.SentencePieceProcessor()
	sp.Load("ar_tokenizer_32k.model")

	text = "استوديوهات أفلام والت ديزني أفلام والت ديزني منتجع والت ديزني العالمي ديزني لاند"
	tokens = sp.EncodeAsPieces(text)
	ids = sp.EncodeAsIds(text)

	print(tokens) # subword pieces
	print(ids) # integer ids

	# Decode back
	print(sp.DecodeIds(ids))
	```

	<details>
	<summary><b>Tokenization examples (click to expand)</b></summary>

	Sample 1: `استوديوهات أفلام والت ديزني أفلام والت ديزني منتجع والت ديزني العالمي ديزني لاند…`

	\| Vocab \| Tokens \| Count \|
	\|-------\|--------\|-------\|
	\| 8k \| `▁است ودي وه ات ▁أفلام ▁والت ▁دي ز ني ▁أفلام … (+22 more)` \| 32 \|
	\| 16k \| `▁است ودي وهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منت … (+10 more)` \| 20 \|
	\| 32k \| `▁استوديوهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منتجع ▁والت ▁ديزني … (+7 more)` \| 17 \|
	\| 64k \| `▁استوديوهات ▁أفلام ▁والت ▁ديزني ▁أفلام ▁والت ▁ديزني ▁منتجع ▁والت ▁ديزني … (+7 more)` \| 17 \|

	Sample 2: `باسكال قد تعني: الباسكال، وحدة قياس الضغط لغة باسكال، لغة برمجة الفيلسوف باسكال،…`

	\| Vocab \| Tokens \| Count \|
	\|-------\|--------\|-------\|
	\| 8k \| `▁با سك ال ▁قد ▁تعني : ▁البا سك ال ، … (+29 more)` \| 39 \|
	\| 16k \| `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+18 more)` \| 28 \|
	\| 32k \| `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+15 more)` \| 25 \|
	\| 64k \| `▁باسكال ▁قد ▁تعني : ▁الباسك ال ، ▁وحدة ▁قياس ▁الضغط … (+15 more)` \| 25 \|

	Sample 3: `جمهورية الكونغو الديمقراطية، زائير سابقًا، عاصمتها كينشاسا. جمهورية الكونغو، عاص…`

	\| Vocab \| Tokens \| Count \|
	\|-------\|--------\|-------\|
	\| 8k \| `▁جمهورية ▁الكون غو ▁الديمقراطية ، ▁ز ائ ير ▁سابق ًا … (+21 more)` \| 31 \|
	\| 16k \| `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁ز ائ ير ▁سابقًا ، ▁عاصمتها … (+16 more)` \| 26 \|
	\| 32k \| `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁زائ ير ▁سابقًا ، ▁عاصمتها ▁كينشاسا … (+12 more)` \| 22 \|
	\| 64k \| `▁جمهورية ▁الكونغو ▁الديمقراطية ، ▁زائير ▁سابقًا ، ▁عاصمتها ▁كينشاسا . … (+10 more)` \| 20 \|

	</details>

	### Load Word Embeddings

	```python
	from gensim.models import KeyedVectors

	# Aligned embeddings (cross-lingual, mapped to English vector space)
	wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv")

	similar = wv.most_similar("word", topn=5)
	for word, score in similar:
	print(f" {word}: {score:.3f}")
	```

	### Load N-gram Model

	```python
	import pyarrow.parquet as pq

	df = pq.read_table("ar_3gram_word.parquet").to_pandas()
	print(df.head())
	```

	## Models Overview

	![Performance Dashboard](visualizations/performance_dashboard.png)

	\| Category \| Assets \|
	\|----------\|--------\|
	\| Tokenizers \| BPE at 8k, 16k, 32k, 64k vocab sizes \|
	\| N-gram models \| 2 / 3 / 4 / 5-gram (word & subword) \|
	\| Markov chains \| Context 1–5 (word & subword) \|
	\| Embeddings \| 32d, 64d, 128d — mono & aligned \|
	\| Vocabulary \| Full frequency list + Zipf analysis \|
	\| Statistics \| Corpus & model statistics JSON \|

	## Metrics Summary

	\| Component \| Model \| Key Metric \| Value \|
	\|-----------\|-------\|------------\|-------\|
	\| Tokenizer \| 8k BPE \| Compression \| 3.25x \|
	\| Tokenizer \| 16k BPE \| Compression \| 3.65x \|
	\| Tokenizer \| 32k BPE \| Compression \| 4.03x \|
	\| Tokenizer \| 64k BPE \| Compression \| 4.35x 🏆 \|
	\| N-gram \| 2-gram (subword) \| Perplexity \| 426 🏆 \|
	\| N-gram \| 2-gram (word) \| Perplexity \| 359,826 \|
	\| N-gram \| 3-gram (subword) \| Perplexity \| 4,163 \|
	\| N-gram \| 3-gram (word) \| Perplexity \| 775,988 \|
	\| N-gram \| 4-gram (subword) \| Perplexity \| 27,277 \|
	\| N-gram \| 4-gram (word) \| Perplexity \| 1,494,234 \|
	\| N-gram \| 5-gram (subword) \| Perplexity \| 133,736 \|
	\| N-gram \| 5-gram (word) \| Perplexity \| 1,059,510 \|
	\| Markov \| ctx-1 (subword) \| Predictability \| 0.0% \|
	\| Markov \| ctx-1 (word) \| Predictability \| 0.0% \|
	\| Markov \| ctx-2 (subword) \| Predictability \| 17.3% \|
	\| Markov \| ctx-2 (word) \| Predictability \| 67.4% \|
	\| Markov \| ctx-3 (subword) \| Predictability \| 29.5% \|
	\| Markov \| ctx-3 (word) \| Predictability \| 89.5% \|
	\| Markov \| ctx-4 (subword) \| Predictability \| 35.2% \|
	\| Markov \| ctx-4 (word) \| Predictability \| 96.5% 🏆 \|
	\| Vocabulary \| full \| Size \| 986,324 \|
	\| Vocabulary \| full \| Zipf R² \| 0.9920 \|
	\| Embeddings \| mono_32d \| Isotropy \| 0.8111 \|
	\| Embeddings \| mono_64d \| Isotropy \| 0.7841 \|
	\| Embeddings \| mono_128d \| Isotropy \| 0.7556 \|
	\| Embeddings \| aligned_32d \| Isotropy \| 0.8111 🏆 \|
	\| Embeddings \| aligned_64d \| Isotropy \| 0.7841 \|
	\| Embeddings \| aligned_128d \| Isotropy \| 0.7556 \|
	\| Alignment \| aligned_32d \| R@1 / R@5 / R@10 \| 13.4% / 35.0% / 48.6% \|
	\| Alignment \| aligned_64d \| R@1 / R@5 / R@10 \| 28.6% / 54.0% / 65.6% \|
	\| Alignment \| aligned_128d \| R@1 / R@5 / R@10 \| 37.2% / 65.0% / 76.6% 🏆 \|

	📊 [Full ablation study, per-model breakdowns, and interpretation guide →](RESEARCH_REPORT.md)

	---

	## About

	Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) — monthly snapshots of 300+ Wikipedia languages.

	A project by [Wikilangs](https://wikilangs.org) · Maintainer: [Omar Kamali](https://omarkamali.com) · [Omneity Labs](https://omneitylabs.com)

	### Citation

	```bibtex
	@misc{wikilangs2025,
	author = {Kamali, Omar},
	title = {Wikilangs: Open NLP Models for Wikipedia Languages},
	year = {2025},
	doi = {10.5281/zenodo.18073153},
	publisher = {Zenodo},
	url = {https://huggingface.co/wikilangs},
	institution = {Omneity Labs}
	}
	```

	### Links

	- 🌐 [wikilangs.org](https://wikilangs.org)
	- 🌍 [Language page](https://wikilangs.org/languages/ar/)
	- 🎮 [Playground](https://wikilangs.org/playground/?lang=ar)
	- 🤗 [HuggingFace models](https://huggingface.co/wikilangs)
	- 📊 [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
	- 👤 [Omar Kamali](https://huggingface.co/omarkamali)
	- 🤝 Sponsor: [Featherless AI](https://featherless.ai)

	License: MIT — free for academic and commercial use.

	---
	Generated by Wikilangs Pipeline · 2026-03-04 13:56:39