Pashto KenLM 5-gram Language Model
A 5-gram language model for Pashto (پښتو) trained with Modified Kneser-Ney smoothing on 24 million unique sentences drawn from the PashtoCorp corpus (1.25 billion words). Three formats are provided: TRIE (recommended), PROBING (fastest), and ARPA (interoperability).
Performance
Evaluated on 50,000 held-out sentences (1,381,915 words) not seen during training.
| Metric | Value |
|---|---|
| Perplexity | 109.78 |
| OOV rate | 0.281% |
| Vocabulary size | 7,294,164 unigrams |
| Training sentences | 24,035,371 |
| Evaluation sentences | 50,000 |
A 0.28% OOV rate means 99.7% of Pashto words encountered in the wild are in-vocabulary — a direct result of training on a 1.25B-word corpus spanning news, books, web text, and radio transcripts.
Model files
| File | Format | Size | Use case |
|---|---|---|---|
pashto_5gram.trie |
TRIE binary | 3.0 GB | Recommended — good balance of size and speed |
pashto_5gram.binary |
PROBING binary | 5.4 GB | Fastest query speed (hash-table lookup) |
pashto_5gram.arpa |
ARPA text | 10.4 GB | Interoperability with other LM toolkits |
Training details
| Setting | Value |
|---|---|
| Order | 5-gram |
| Smoothing | Modified Kneser-Ney |
| Pruning | Singleton pruning on 3/4/5-grams |
| Training corpus | PashtoCorp (24M unique sentences) |
| Tool | KenLM |
Installation
The Python bindings for KenLM are not on PyPI. Install directly from source:
pip install https://github.com/kpu/kenlm/archive/master.zip
Download the model file:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="ihanif/pashto-kenlm-5gram",
filename="pashto_5gram.trie",
)
Usage examples
Basic sentence scoring
import kenlm
model = kenlm.Model("pashto_5gram.trie")
sentence = "د افغانستان خلک د سولې غوښتونکي دي"
log_prob = model.score(sentence, bos=True, eos=True)
print(f"log₁₀ probability: {log_prob:.4f}")
Perplexity of a sentence
import kenlm
model = kenlm.Model("pashto_5gram.trie")
def perplexity(model, sentence):
words = len(sentence.split())
score = model.score(sentence, bos=True, eos=True)
return 10 ** (-score / words)
sentences = [
"د افغانستان خلک د سولې غوښتونکي دي",
"کابل د افغانستان پلازمینه ده",
]
for s in sentences:
print(f"PPL={perplexity(model, s):.1f} {s}")
Word-level scores (for debugging or ASR rescoring)
import kenlm
model = kenlm.Model("pashto_5gram.trie")
sentence = "د افغانستان خلک د سولې غوښتونکي دي"
for log_p, ngram_len, is_oov in model.full_scores(sentence):
status = "OOV" if is_oov else f"{ngram_len}-gram"
print(f"{log_p:8.4f} [{status}]")
Text quality filtering
Use perplexity to filter low-quality or noisy text — high perplexity indicates the text is unlike natural Pashto.
import kenlm
model = kenlm.Model("pashto_5gram.trie")
def is_natural_pashto(model, text, max_ppl=500):
words = len(text.split())
if words < 3:
return False
score = model.score(text, bos=True, eos=True)
ppl = 10 ** (-score / words)
return ppl < max_ppl
texts = [
"د افغانستان خلک د سولې غوښتونکي دي", # natural → keep
"xyz abc 123 foo bar", # noise → drop
]
for t in texts:
print(f"{'KEEP' if is_natural_pashto(model, t) else 'DROP'} {t}")
ASR hypothesis rescoring
KenLM is widely used to rescore ASR n-best lists. Combine acoustic model scores with LM log-probabilities using a tunable weight λ:
import kenlm
lm = kenlm.Model("pashto_5gram.trie")
def rescore(hypotheses, lm, lm_weight=0.3):
"""
hypotheses: list of (acoustic_score, text) tuples
Returns the hypothesis with the highest combined score.
"""
best_score, best_hyp = float("-inf"), ""
for acoustic_score, text in hypotheses:
words = len(text.split()) or 1
lm_score = lm.score(text, bos=True, eos=True) / words
combined = acoustic_score + lm_weight * lm_score
if combined > best_score:
best_score, best_hyp = combined, text
return best_hyp
# Example n-best list (acoustic_score, hypothesis)
nbest = [
(-12.3, "د افغانستان خلک"),
(-11.9, "د افغانستان خالک"),
(-13.1, "د افغانستان هلک"),
]
print(rescore(nbest, lm))
Corpus deduplication / quality ranking
Score a corpus and keep only sentences below a perplexity threshold:
import kenlm
model = kenlm.Model("pashto_5gram.trie")
def score_file(model, input_path, output_path, max_ppl=300):
kept = 0
with open(input_path, encoding="utf-8") as fin, \
open(output_path, "w", encoding="utf-8") as fout:
for line in fin:
sentence = line.strip()
if not sentence:
continue
words = len(sentence.split())
if words < 3:
continue
ppl = 10 ** (-model.score(sentence) / words)
if ppl < max_ppl:
fout.write(line)
kept += 1
return kept
Related resources
- Corpus: ihanif/pashto-corpus — 1.25B-word Pashto corpus
- Encoder: ihanif/xlmr-pashto — XLM-R continued pre-training on PashtoCorp
- Code: ihanif/corpus_builder
Citation
If you use this model in your research, please cite:
@misc{rahman2026pashtocorp,
title = {PashtoCorp: A 1.25B-Word Corpus, Evaluation Suite, and
Reproducible Pipeline for Low-Resource Language Development},
author = {Rahman, Hanif},
year = {2026},
howpublished = {\url{https://huggingface.co/ihanif/pashto-kenlm-5gram}},
}
Evaluation results
- Perplexity (5-gram MKN) on PashtoCorp held-outself-reported109.780