HAFITH — حافظ · Arabic Manuscript OCR
OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:
- Detects text regions (main body vs. margin) — YOLO
- Segments individual lines — RTMDet instance segmentation
- Recognises text per line — SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
- Corrects OCR errors — Gemini LLM (optional, requires API key)
Model Files
| File | Description | Size |
|---|---|---|
lines.pth |
RTMDet-m line segmentation weights | 242 MB |
regions.pt |
YOLO region detection weights | 117 MB |
ocr/model.pt |
SigLIP2 + Qwen3-0.6B OCR weights | 3.9 GB |
ocr/qwen_tokenizer/ |
Qwen3 tokenizer files | — |
ocr/siglip_processor/ |
SigLIP2 image processor config | — |
rtmdet_lines.py |
RTMDet model config | — |
Architecture
Input image
│
├─► YOLO (regions.pt)
│ └─ Bounding boxes: main text body vs. margin
│
├─► RTMDet (lines.pth + rtmdet_lines.py)
│ └─ Instance segmentation masks → line polygons (reading order)
│
└─► Per-line crops
└─► SigLIP2 NaFlex encoder → Linear(1152→1024) → Qwen3-0.6B decoder
└─ Arabic text string per line
The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are prepended as a visual prefix to Qwen3's input embedding space, followed by a BOS anchor token. The decoder autoregressively generates Arabic text tokens.
Requirements
pip install torch torchvision transformers ultralytics opencv-python-headless \
Pillow numpy google-genai huggingface_hub
# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
pip install mmdet mmengine
Quick Start
from huggingface_hub import snapshot_download
# Download all model files
model_dir = snapshot_download("mdnaseif/hafith-models")
Then run full pipeline inference — see inference.py.
Full Pipeline Inference
import sys
sys.path.insert(0, "path/to/hafith_mvp/app") # add app/ to Python path
from pipeline import (
load_lines_model, load_regions_model,
load_ocr,
segment, detect_regions, classify_lines_by_region,
get_line_images, recognise_lines_batch,
)
MODELS_DIR = "path/to/models" # local snapshot_download() output
# 1. Load models (one-time, ~30–90s on first run)
lines_model = load_lines_model(
config_path=f"{MODELS_DIR}/rtmdet_lines.py",
checkpoint_path=f"{MODELS_DIR}/lines.pth",
device="cuda",
)
regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")
# 2. Segment lines
image_bgr, polygons = segment(lines_model, "manuscript.jpg")
# 3. Classify main text vs. margin
region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)
# 4. Crop line images
line_images = get_line_images(image_bgr, polygons)
# 5. OCR — process in reading order (main body first, then margin)
reading_order = list(main_idx) + list(margin_idx)
ordered_images = [line_images[i] for i in reading_order]
texts = recognise_lines_batch(
ocr_model, processor, tokenizer,
ordered_images,
device="cuda",
max_patches=512,
max_len=64,
batch_size=8,
)
# 6. Print results
for i, (idx, text) in enumerate(zip(reading_order, texts)):
print(f"Line {i+1}: {text}")
full_text = "\n".join(texts)
print("\n--- Full transcription ---")
print(full_text)
OCR Model Only (no segmentation)
If you already have cropped line images:
from PIL import Image
from pipeline.ocr import load_ocr, recognise_lines_batch
ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")
# Single line
line_img = Image.open("line.jpg")
texts = recognise_lines_batch(
ocr_model, processor, tokenizer,
[line_img],
device="cuda",
max_patches=512,
max_len=64,
batch_size=1,
)
print(texts[0])
Optional: AI Post-Correction (Gemini)
import os
os.environ["GEMINI_API_KEY"] = "your-key"
from pipeline.correction import init_local_llm, correct_full_text_local
corrector = init_local_llm("gemini-2.0-flash")
corrected = correct_full_text_local(corrector, texts)
Citation
@misc{hafith2025,
title = {HAFITH: Arabic Manuscript OCR Pipeline},
author = {mdnaseif},
year = {2025},
url = {https://huggingface.co/mdnaseif/hafith-models}
}