Instructions to use mlboydaisuke/ColModernVBERT-CoreAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use mlboydaisuke/ColModernVBERT-CoreAI with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ColModernVBERT β Core AI
The zoo's first visual document retriever and first late-interaction (ColBERT / MaxSim)
multi-vector model, running as static .aimodel graphs on Apple Silicon (Mac GPU / iPhone).
A Core AI port of ModernVBERT/colmodernvbert
(MIT) β a compact 250M visual document retriever: a ModernBERT-150M bidirectional text
encoder + SigLIP2 vision encoder (pixel-shuffle Γ4) with a custom_text_proj head that
emits a per-token L2-normalized 128-d multi-vector. Retrieval is late interaction: you
encode a text query and a page image into token-level vectors and score them with MaxSim
(score = Ξ£_q max_d β¨E_q, E_dβ©). No OCR β the page is matched as a picture, so tables, charts
and complex layouts are first-class.
This completes the on-device RAG trifecta alongside the text Qwen3-Embedding (textβtext dense) and Qwen3-Reranker (cross-encoder): embed β rerank β visual-retrieval, all on device.
Use it
βΆοΈ Run it (source) β the DocSearch runner (visual page search over bundled sample pages; the GUI (iPhone) adds tiled where-it-matched highlights):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/DocSearch/DocSearch.xcodeproj
# β Run, then pick "ColModernVBERT" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/DocSearch
swift run docsearch-cli --model colmodernvbert --query "monthly revenue trend"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKitEmbeddings
let retriever = try await VisualDocumentRetriever(
catalog: "colmodernvbert")
var corpus: [VisualDocumentRetriever.PageEmbedding] = []
for url in pages {
corpus.append(try await retriever.encode(page: ImageFile.load(url).cgImage))
}
let hits = try await retriever.retrieve(query: query, over: corpus, topK: pages.count)
// hits: pages ranked by MaxSim, best match first β no OCR, pages are matched as pictures
The take-home is Examples/DocSearch/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same VisualDocumentRetriever(catalog:) with tiled per-page encoding.
Encode your corpus once and keep the PageEmbeddings β scoring a query is then host-side
MaxSim, no model call per page. encodeTiled(page:) localizes where a query matched.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKitEmbeddings - Info.plist:
NSPhotoLibraryUsageDescriptionβ only if you use PhotosPicker to import pages - Entitlements: none needed
- First run downloads the model β 0.7 GB (Mac) / 0.7 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Two encoders (two graphs)
| graph | input | output | fp16 size |
|---|---|---|---|
| query | input_ids [1,32] i32, attention_mask [1,32] i32 |
query_embeddings [1,32,128] |
298 MB |
| doc | pixel_values [1,1,3,512,512], pixel_attention_mask [1,1,512,512] i32 |
doc_embeddings [1,89,128] |
407 MB |
Both are single bidirectional forwards β no KV cache, no generation. The per-token L2-norm and
the attention_mask masking are baked in-graph; MaxSim runs on the host (a tiny matmul +
max + sum). Each bundle directory holds one *.aimodel plus a tokenizer/ folder.
- query: right-pad the tokenized query to the 32-token grid (queries are short; ModernBERT's sliding-window(128) sees the full sequence β full attention). Slice to the real token count before MaxSim.
- doc: a single 512Γ512 tile ("global image") layout β the text template (CLS + image
markers + 64
<image>placeholders + SEP) is baked as a graph constant, so the only runtime inputs are the pixels. Preprocess the page like Idefics3: resize so the longest edge β€ 512, pad to 512Γ512, rescale Γ1/255, normalize with mean/std = 0.5, and build thepixel_attention_mask(1 for real pixels, 0 for padding).
Single-tile v1. This release ships the single 512px global-image document path: lightweight, iPhone-friendly, and accurate on typical pages. The model's full high-resolution mode (split a page into multiple 512px tiles + the global image, 800+ doc tokens) is a planned follow-up for dense small-print documents.
Repo layout
query/ colmodernvbert-query_float16_s32_static.aimodel + tokenizer/ (298 MB, fp16 β iPhone)
doc/ colmodernvbert-doc_float16_s89_static.aimodel (407 MB, fp16 β iPhone)
fp32/query/ colmodernvbert-query_float32_s32_static.aimodel + tokenizer/ (595 MB β Mac)
fp32/doc/ colmodernvbert-doc_float32_s89_static.aimodel (813 MB β Mac)
README.md Β· reference_query.json Β· reference_doc.json Β· test_doc.png
Each query/ and doc/ directory is a complete bundle root (one .aimodel, plus tokenizer/
on the query side). fp16 ships for iPhone (~705 MB for both encoders); fp32 is for Mac / max
precision.
On-device (CoreAIKit)
import CoreAIKitEmbeddings
// Downloads query/ + doc/ (fp16) from this repo, or uses a sideloaded copy if present.
let retriever = try await VisualDocumentRetriever() // .colModernVBERTQuery / .colModernVBERTDoc
// Encode a page as tiles (reliable spatial grounding), rank queries, and locate the match.
let page = try await retriever.encodeTiled(page: cgImage, rows: 6, cols: 4)
let q = try await retriever.encode(query: "total revenue in the third quarter")
let score = retriever.score(query: q, tiledPage: page) // MaxSim, page ranking
let rect = retriever.bestTile(query: q, tiledPage: page) // normalized region to highlight
See Examples/DocSearch
for a full iPhone demo (bundled + imported documents, query β ranked pages β highlighted region).
Parity (Core AI engine vs. PyTorch reference, M4 Max GPU)
Per-token cosine of the 128-d multi-vectors against the colpali_engine PyTorch model:
| encoder | float32 | float16 |
|---|---|---|
| query | min/mean 1.000000 | min 0.999997 / mean 0.999999 |
| doc | min/mean 1.000000 | min 0.999994 / mean 0.999998 |
End-to-end retrieval: the host MaxSim reproduces processor.score exactly (max |Ξ| = 0.0000),
the engine ranking matches the PyTorch ranking on every clear-margin query, and the single-tile
engine retrieves the intended page 3/3 on a rendered-text corpus.
License
MIT, inherited from ModernVBERT/colmodernvbert.
See the upstream model and paper ModernVBERT: Towards Smaller Visual Document Retrievers
(arXiv:2510.01149).
Model tree for mlboydaisuke/ColModernVBERT-CoreAI
Base model
ModernVBERT/colmodernvbert