GLM-OCR → Core AI

On-device document OCR, running entirely on Apple's Core AI (Neural Engine / GPU). A port of zai-org/GLM-OCR (0.9B, MIT) — a small, SOTA-quality document recognizer (OmniDocBench v1.5 94.62, #1 with its layout pipeline). Prompt it with Text Recognition: / Table Recognition: / Formula Recognition: and get back plain text (reading order), HTML tables (<table>…), or LaTeX. zh / en / fr / es / ru / de / ja / ko.

GLM-OCR is a small OCR variant of GLM-4.V (Glm4v): a CogViT vision tower + a 16-layer GLM text decoder with sectioned 3D M-RoPE. This port reuses the shipped Qwen3-VL vision idiom and GLM text decode — no R-SWA, no MoE, no MLA.

Bundles

dir	what	precision	size
`vision/`	CogViT encoder → `image_embeds [N, 1536]`	fp16	829 MB
`decoder/`	GLM text decoder, S=1 pipelined, M-RoPE + image injection	int8hu (body int8 per-block-32 + untied head absmax)	764 MB
`tokenizer/`	`tokenizer.json` etc.	—	—

The decoder rides three static graph inputs — image_embeds [682,1536] f16, rope_shift_start [1], rope_shift_amount [1] — so the vision tower runs once, its output is injected at the image placeholder positions (V + slot, row-major over the merged grid), and the text decodes on top. N (visual-token count) is fixed at export by the chosen input resolution (here 682 = a 22×31 merged grid); resize the page to that grid host-side.

Verified (M4 Max, GPU, Core AI pipelined engine)

End-to-end real generation on the engine: 40/40 tokens identical to the fp32 HF reference — a synthetic document read verbatim ("Quarterly Report / On-device inference shipped across all product lines this quarter…"), ~375 tok/s decode.
Torch ladder vs HF: decoder logits cos 1.000020, vision image_embeds cos 1.000061, full-VLM argmax 694/694.
Engine gate: vision image_embeds cos 0.9998; decoder argmax exact over the sampled positions.
int8hu vs fp16: 7 / 694 argmax flips, all at visual-token positions (0 in the text region), the generation-driving position exact — i.e. the OCR text is preserved.

Run it

The decoder is a standard Core AI pipelined LLM bundle with three multimodal static inputs. Drive it with the pipelined engine (S=1, COREAI_CHUNK_THRESHOLD=1; feed the prompt with the image placeholders rewritten to V+slot, bind image_embeds from the vision tower, set rope_shift_start = img_start + N, rope_shift_amount = N − max(gh, gw)). The full conversion recipe and the host contract (with the exact static-input values) are in the Core AI model zoo — conversion/export_glm_ocr_pipelined.py, zoo/glm-ocr.md, knowledge/glm-ocr-port.md.

Scope / honesty

This is the recognition model: per-prompt text / table / formula. Whole-page auto-structuring (the 94.62 full-pipeline number) additionally needs a layout detector (PP-DocLayoutV3) that is not part of this port.
int4 is not shipped (weight-only int4 without QAT risks a quality cliff on a 0.9B model).

License

MIT (inherited from zai-org/GLM-OCR). Community port — not affiliated with Apple or Z.ai.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/GLM-OCR-CoreAI

Base model

zai-org/GLM-OCR

Finetuned

(27)

this model