GLM-OCR β Core AI
On-device document OCR, running entirely on Apple's Core AI (Neural Engine / GPU).
A port of zai-org/GLM-OCR (0.9B, MIT) β a small,
SOTA-quality document recognizer (OmniDocBench v1.5 94.62, #1 with its layout pipeline).
Prompt it with Text Recognition: / Table Recognition: / Formula Recognition: and get back
plain text (reading order), HTML tables (<table>β¦), or LaTeX. zh / en / fr / es / ru / de / ja / ko.
GLM-OCR is a small OCR variant of GLM-4.V (Glm4v): a CogViT vision tower + a 16-layer GLM text
decoder with sectioned 3D M-RoPE. This port reuses the shipped Qwen3-VL vision idiom and GLM text
decode β no R-SWA, no MoE, no MLA.
Bundles
| dir | what | precision | size |
|---|---|---|---|
vision/ |
CogViT encoder β image_embeds [N, 1536] |
fp16 | 829 MB |
decoder/ |
GLM text decoder, S=1 pipelined, M-RoPE + image injection | int8hu (body int8 per-block-32 + untied head absmax) | 764 MB |
tokenizer/ |
tokenizer.json etc. |
β | β |
The decoder rides three static graph inputs β image_embeds [682,1536] f16, rope_shift_start [1],
rope_shift_amount [1] β so the vision tower runs once, its output is injected at the image
placeholder positions (V + slot, row-major over the merged grid), and the text decodes on top.
N (visual-token count) is fixed at export by the chosen input resolution (here 682 = a 22Γ31 merged
grid); resize the page to that grid host-side.
Verified (M4 Max, GPU, Core AI pipelined engine)
- End-to-end real generation on the engine: 40/40 tokens identical to the fp32 HF reference β a synthetic document read verbatim ("Quarterly Report / On-device inference shipped across all product lines this quarterβ¦"), ~375 tok/s decode.
- Torch ladder vs HF: decoder logits cos 1.000020, vision
image_embedscos 1.000061, full-VLM argmax 694/694. - Engine gate: vision
image_embedscos 0.9998; decoder argmax exact over the sampled positions. - int8hu vs fp16: 7 / 694 argmax flips, all at visual-token positions (0 in the text region), the generation-driving position exact β i.e. the OCR text is preserved.
Run it
The decoder is a standard Core AI pipelined LLM bundle with three multimodal static inputs. Drive it
with the pipelined engine (S=1, COREAI_CHUNK_THRESHOLD=1; feed the prompt with the image
placeholders rewritten to V+slot, bind image_embeds from the vision tower, set
rope_shift_start = img_start + N, rope_shift_amount = N β max(gh, gw)). The full conversion recipe
and the host contract (with the exact static-input values) are in the
Core AI model zoo β
conversion/export_glm_ocr_pipelined.py, zoo/glm-ocr.md, knowledge/glm-ocr-port.md.
Scope / honesty
- This is the recognition model: per-prompt text / table / formula. Whole-page auto-structuring (the 94.62 full-pipeline number) additionally needs a layout detector (PP-DocLayoutV3) that is not part of this port.
- int4 is not shipped (weight-only int4 without QAT risks a quality cliff on a 0.9B model).
License
MIT (inherited from zai-org/GLM-OCR). Community port β not affiliated with Apple or Z.ai.
Model tree for mlboydaisuke/GLM-OCR-CoreAI
Base model
zai-org/GLM-OCR