HAFITH — حافظ · Arabic Manuscript OCR

OCR pipeline for Arabic historical manuscripts. Given a manuscript image it:

Detects text regions (main body vs. margin) — YOLO
Segments individual lines — RTMDet instance segmentation
Recognises text per line — SigLIP2 NaFlex + Qwen3-0.6B (Prefix-LM)
Corrects OCR errors — Gemini LLM (optional, requires API key)

Model Files

File	Description	Size
`lines.pth`	RTMDet-m line segmentation weights	242 MB
`regions.pt`	YOLO region detection weights	117 MB
`ocr/model.pt`	SigLIP2 + Qwen3-0.6B OCR weights	3.9 GB
`ocr/qwen_tokenizer/`	Qwen3 tokenizer files	—
`ocr/siglip_processor/`	SigLIP2 image processor config	—
`rtmdet_lines.py`	RTMDet model config	—

Architecture

Input image
    │
    ├─► YOLO (regions.pt)
    │       └─ Bounding boxes: main text body vs. margin
    │
    ├─► RTMDet (lines.pth + rtmdet_lines.py)
    │       └─ Instance segmentation masks → line polygons (reading order)
    │
    └─► Per-line crops
            └─► SigLIP2 NaFlex encoder → Linear(1152→1024) → Qwen3-0.6B decoder
                        └─ Arabic text string per line

The OCR model is a custom Prefix-LM: visual patch embeddings from SigLIP2 are prepended as a visual prefix to Qwen3's input embedding space, followed by a BOS anchor token. The decoder autoregressively generates Arabic text tokens.

Requirements

pip install torch torchvision transformers ultralytics opencv-python-headless \
            Pillow numpy google-genai huggingface_hub

# mmcv must be built from source (no pre-built wheel for torch 2.9 + CUDA 12.8)
git clone --depth=1 --branch v2.1.0 https://github.com/open-mmlab/mmcv.git /opt/mmcv
cd /opt/mmcv && MMCV_WITH_OPS=1 pip install -e . --no-build-isolation
pip install mmdet mmengine

Quick Start

from huggingface_hub import snapshot_download

# Download all model files
model_dir = snapshot_download("mdnaseif/hafith-models")

Then run full pipeline inference — see inference.py.

Full Pipeline Inference

import sys
sys.path.insert(0, "path/to/hafith_mvp/app")   # add app/ to Python path

from pipeline import (
    load_lines_model, load_regions_model,
    load_ocr,
    segment, detect_regions, classify_lines_by_region,
    get_line_images, recognise_lines_batch,
)

MODELS_DIR = "path/to/models"   # local snapshot_download() output

# 1. Load models (one-time, ~30–90s on first run)
lines_model = load_lines_model(
    config_path=f"{MODELS_DIR}/rtmdet_lines.py",
    checkpoint_path=f"{MODELS_DIR}/lines.pth",
    device="cuda",
)
regions_model = load_regions_model(f"{MODELS_DIR}/regions.pt")
ocr_model, processor, tokenizer = load_ocr(f"{MODELS_DIR}/ocr", device="cuda")

# 2. Segment lines
image_bgr, polygons = segment(lines_model, "manuscript.jpg")

# 3. Classify main text vs. margin
region_polys, _ = detect_regions(regions_model, "manuscript.jpg")
main_idx, margin_idx, _ = classify_lines_by_region(polygons, region_polys)

# 4. Crop line images
line_images = get_line_images(image_bgr, polygons)

# 5. OCR — process in reading order (main body first, then margin)
reading_order = list(main_idx) + list(margin_idx)
ordered_images = [line_images[i] for i in reading_order]

texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    ordered_images,
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=8,
)

# 6. Print results
for i, (idx, text) in enumerate(zip(reading_order, texts)):
    print(f"Line {i+1}: {text}")

full_text = "\n".join(texts)
print("\n--- Full transcription ---")
print(full_text)

OCR Model Only (no segmentation)

If you already have cropped line images:

from PIL import Image
from pipeline.ocr import load_ocr, recognise_lines_batch

ocr_model, processor, tokenizer = load_ocr("path/to/models/ocr", device="cuda")

# Single line
line_img = Image.open("line.jpg")
texts = recognise_lines_batch(
    ocr_model, processor, tokenizer,
    [line_img],
    device="cuda",
    max_patches=512,
    max_len=64,
    batch_size=1,
)
print(texts[0])

Optional: AI Post-Correction (Gemini)

import os
os.environ["GEMINI_API_KEY"] = "your-key"

from pipeline.correction import init_local_llm, correct_full_text_local

corrector = init_local_llm("gemini-2.0-flash")
corrected = correct_full_text_local(corrector, texts)

Citation

@misc{hafith2025,
  title  = {HAFITH: Arabic Manuscript OCR Pipeline},
  author = {mdnaseif},
  year   = {2025},
  url    = {https://huggingface.co/mdnaseif/hafith-models}
}

Downloads last month: -; Downloads are not tracked for this model. How to track