Pashto KenLM 5-gram Language Model

A 5-gram language model for Pashto (پښتو) trained with Modified Kneser-Ney smoothing on 24 million unique sentences drawn from the PashtoCorp corpus (1.25 billion words). Three formats are provided: TRIE (recommended), PROBING (fastest), and ARPA (interoperability).

Performance

Evaluated on 50,000 held-out sentences (1,381,915 words) not seen during training.

Metric	Value
Perplexity	109.78
OOV rate	0.281%
Vocabulary size	7,294,164 unigrams
Training sentences	24,035,371
Evaluation sentences	50,000

A 0.28% OOV rate means 99.7% of Pashto words encountered in the wild are in-vocabulary — a direct result of training on a 1.25B-word corpus spanning news, books, web text, and radio transcripts.

Model files

File	Format	Size	Use case
`pashto_5gram.trie`	TRIE binary	3.0 GB	Recommended — good balance of size and speed
`pashto_5gram.binary`	PROBING binary	5.4 GB	Fastest query speed (hash-table lookup)
`pashto_5gram.arpa`	ARPA text	10.4 GB	Interoperability with other LM toolkits

Training details

Setting	Value
Order	5-gram
Smoothing	Modified Kneser-Ney
Pruning	Singleton pruning on 3/4/5-grams
Training corpus	PashtoCorp (24M unique sentences)
Tool	KenLM

Installation

The Python bindings for KenLM are not on PyPI. Install directly from source:

pip install https://github.com/kpu/kenlm/archive/master.zip

Download the model file:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="ihanif/pashto-kenlm-5gram",
    filename="pashto_5gram.trie",
)

Usage examples

Basic sentence scoring

import kenlm

model = kenlm.Model("pashto_5gram.trie")

sentence = "د افغانستان خلک د سولې غوښتونکي دي"
log_prob = model.score(sentence, bos=True, eos=True)
print(f"log₁₀ probability: {log_prob:.4f}")

Perplexity of a sentence

import kenlm

model = kenlm.Model("pashto_5gram.trie")

def perplexity(model, sentence):
    words = len(sentence.split())
    score = model.score(sentence, bos=True, eos=True)
    return 10 ** (-score / words)

sentences = [
    "د افغانستان خلک د سولې غوښتونکي دي",
    "کابل د افغانستان پلازمینه ده",
]
for s in sentences:
    print(f"PPL={perplexity(model, s):.1f}  {s}")

Word-level scores (for debugging or ASR rescoring)

import kenlm

model = kenlm.Model("pashto_5gram.trie")

sentence = "د افغانستان خلک د سولې غوښتونکي دي"
for log_p, ngram_len, is_oov in model.full_scores(sentence):
    status = "OOV" if is_oov else f"{ngram_len}-gram"
    print(f"{log_p:8.4f}  [{status}]")

Text quality filtering

Use perplexity to filter low-quality or noisy text — high perplexity indicates the text is unlike natural Pashto.

import kenlm

model = kenlm.Model("pashto_5gram.trie")

def is_natural_pashto(model, text, max_ppl=500):
    words = len(text.split())
    if words < 3:
        return False
    score = model.score(text, bos=True, eos=True)
    ppl = 10 ** (-score / words)
    return ppl < max_ppl

texts = [
    "د افغانستان خلک د سولې غوښتونکي دي",   # natural → keep
    "xyz abc 123 foo bar",                      # noise → drop
]
for t in texts:
    print(f"{'KEEP' if is_natural_pashto(model, t) else 'DROP'}  {t}")

ASR hypothesis rescoring

KenLM is widely used to rescore ASR n-best lists. Combine acoustic model scores with LM log-probabilities using a tunable weight λ:

import kenlm

lm = kenlm.Model("pashto_5gram.trie")

def rescore(hypotheses, lm, lm_weight=0.3):
    """
    hypotheses: list of (acoustic_score, text) tuples
    Returns the hypothesis with the highest combined score.
    """
    best_score, best_hyp = float("-inf"), ""
    for acoustic_score, text in hypotheses:
        words = len(text.split()) or 1
        lm_score = lm.score(text, bos=True, eos=True) / words
        combined = acoustic_score + lm_weight * lm_score
        if combined > best_score:
            best_score, best_hyp = combined, text
    return best_hyp

# Example n-best list (acoustic_score, hypothesis)
nbest = [
    (-12.3, "د افغانستان خلک"),
    (-11.9, "د افغانستان خالک"),
    (-13.1, "د افغانستان هلک"),
]
print(rescore(nbest, lm))

Corpus deduplication / quality ranking

Score a corpus and keep only sentences below a perplexity threshold:

import kenlm

model = kenlm.Model("pashto_5gram.trie")

def score_file(model, input_path, output_path, max_ppl=300):
    kept = 0
    with open(input_path, encoding="utf-8") as fin, \
         open(output_path, "w", encoding="utf-8") as fout:
        for line in fin:
            sentence = line.strip()
            if not sentence:
                continue
            words = len(sentence.split())
            if words < 3:
                continue
            ppl = 10 ** (-model.score(sentence) / words)
            if ppl < max_ppl:
                fout.write(line)
                kept += 1
    return kept

Related resources

Corpus: ihanif/pashto-corpus — 1.25B-word Pashto corpus
Encoder: ihanif/xlmr-pashto — XLM-R continued pre-training on PashtoCorp
Code: ihanif/corpus_builder

Citation

If you use this model in your research, please cite:

@misc{rahman2026pashtocorp,
  title        = {PashtoCorp: A 1.25B-Word Corpus, Evaluation Suite, and
                  Reproducible Pipeline for Low-Resource Language Development},
  author       = {Rahman, Hanif},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ihanif/pashto-kenlm-5gram}},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Perplexity (5-gram MKN) on PashtoCorp held-out
self-reported

109.780