GLM-OCR β†’ Core AI

On-device document OCR, running entirely on Apple's Core AI (Neural Engine / GPU). A port of zai-org/GLM-OCR (0.9B, MIT) β€” a small, SOTA-quality document recognizer (OmniDocBench v1.5 94.62, #1 with its layout pipeline). Prompt it with Text Recognition: / Table Recognition: / Formula Recognition: and get back plain text (reading order), HTML tables (<table>…), or LaTeX. zh / en / fr / es / ru / de / ja / ko.

GLM-OCR is a small OCR variant of GLM-4.V (Glm4v): a CogViT vision tower + a 16-layer GLM text decoder with sectioned 3D M-RoPE. This port reuses the shipped Qwen3-VL vision idiom and GLM text decode β€” no R-SWA, no MoE, no MLA.

Bundles

dir what precision size
vision/ CogViT encoder β†’ image_embeds [N, 1536] fp16 829 MB
decoder/ GLM text decoder, S=1 pipelined, M-RoPE + image injection int8hu (body int8 per-block-32 + untied head absmax) 764 MB
tokenizer/ tokenizer.json etc. β€” β€”

The decoder rides three static graph inputs β€” image_embeds [682,1536] f16, rope_shift_start [1], rope_shift_amount [1] β€” so the vision tower runs once, its output is injected at the image placeholder positions (V + slot, row-major over the merged grid), and the text decodes on top. N (visual-token count) is fixed at export by the chosen input resolution (here 682 = a 22Γ—31 merged grid); resize the page to that grid host-side.

Verified (M4 Max, GPU, Core AI pipelined engine)

  • End-to-end real generation on the engine: 40/40 tokens identical to the fp32 HF reference β€” a synthetic document read verbatim ("Quarterly Report / On-device inference shipped across all product lines this quarter…"), ~375 tok/s decode.
  • Torch ladder vs HF: decoder logits cos 1.000020, vision image_embeds cos 1.000061, full-VLM argmax 694/694.
  • Engine gate: vision image_embeds cos 0.9998; decoder argmax exact over the sampled positions.
  • int8hu vs fp16: 7 / 694 argmax flips, all at visual-token positions (0 in the text region), the generation-driving position exact β€” i.e. the OCR text is preserved.

Run it

The decoder is a standard Core AI pipelined LLM bundle with three multimodal static inputs. Drive it with the pipelined engine (S=1, COREAI_CHUNK_THRESHOLD=1; feed the prompt with the image placeholders rewritten to V+slot, bind image_embeds from the vision tower, set rope_shift_start = img_start + N, rope_shift_amount = N βˆ’ max(gh, gw)). The full conversion recipe and the host contract (with the exact static-input values) are in the Core AI model zoo β€” conversion/export_glm_ocr_pipelined.py, zoo/glm-ocr.md, knowledge/glm-ocr-port.md.

Scope / honesty

  • This is the recognition model: per-prompt text / table / formula. Whole-page auto-structuring (the 94.62 full-pipeline number) additionally needs a layout detector (PP-DocLayoutV3) that is not part of this port.
  • int4 is not shipped (weight-only int4 without QAT risks a quality cliff on a 0.9B model).

License

MIT (inherited from zai-org/GLM-OCR). Community port β€” not affiliated with Apple or Z.ai.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/GLM-OCR-CoreAI

Base model

zai-org/GLM-OCR
Finetuned
(27)
this model