ColModernVBERT β€” Core AI

The zoo's first visual document retriever and first late-interaction (ColBERT / MaxSim) multi-vector model, running as static .aimodel graphs on Apple Silicon (Mac GPU / iPhone). A Core AI port of ModernVBERT/colmodernvbert (MIT) β€” a compact 250M visual document retriever: a ModernBERT-150M bidirectional text encoder + SigLIP2 vision encoder (pixel-shuffle Γ—4) with a custom_text_proj head that emits a per-token L2-normalized 128-d multi-vector. Retrieval is late interaction: you encode a text query and a page image into token-level vectors and score them with MaxSim (score = Ξ£_q max_d ⟨E_q, E_d⟩). No OCR β€” the page is matched as a picture, so tables, charts and complex layouts are first-class.

This completes the on-device RAG trifecta alongside the text Qwen3-Embedding (text→text dense) and Qwen3-Reranker (cross-encoder): embed → rerank → visual-retrieval, all on device.

Use it

▢️ Run it (source) β€” the DocSearch runner (visual page search over bundled sample pages; the GUI (iPhone) adds tiled where-it-matched highlights):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/DocSearch/DocSearch.xcodeproj
# β†’ Run, then pick "ColModernVBERT" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/DocSearch
swift run docsearch-cli --model colmodernvbert --query "monthly revenue trend"

πŸ’» Build with it β€” complete; the glue is kit API, copy-paste runs:

import CoreAIKitEmbeddings

let retriever = try await VisualDocumentRetriever(
    catalog: "colmodernvbert")
var corpus: [VisualDocumentRetriever.PageEmbedding] = []
for url in pages {
    corpus.append(try await retriever.encode(page: ImageFile.load(url).cgImage))
}
let hits = try await retriever.retrieve(query: query, over: corpus, topK: pages.count)
// hits: pages ranked by MaxSim, best match first β€” no OCR, pages are matched as pictures

The take-home is Examples/DocSearch/Sources/QuickStart.swift β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same VisualDocumentRetriever(catalog:) with tiled per-page encoding. Encode your corpus once and keep the PageEmbeddings β€” scoring a query is then host-side MaxSim, no model call per page. encodeTiled(page:) localizes where a query matched.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit β†’ product CoreAIKitEmbeddings
  • Info.plist: NSPhotoLibraryUsageDescription β€” only if you use PhotosPicker to import pages
  • Entitlements: none needed
  • First run downloads the model β€” 0.7 GB (Mac) / 0.7 GB (iPhone) β€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release β€” Debug is ~3Γ— slower on per-token host work

Two encoders (two graphs)

graph input output fp16 size
query input_ids [1,32] i32, attention_mask [1,32] i32 query_embeddings [1,32,128] 298 MB
doc pixel_values [1,1,3,512,512], pixel_attention_mask [1,1,512,512] i32 doc_embeddings [1,89,128] 407 MB

Both are single bidirectional forwards β€” no KV cache, no generation. The per-token L2-norm and the attention_mask masking are baked in-graph; MaxSim runs on the host (a tiny matmul + max + sum). Each bundle directory holds one *.aimodel plus a tokenizer/ folder.

  • query: right-pad the tokenized query to the 32-token grid (queries are short; ModernBERT's sliding-window(128) sees the full sequence β†’ full attention). Slice to the real token count before MaxSim.
  • doc: a single 512Γ—512 tile ("global image") layout β€” the text template (CLS + image markers + 64 <image> placeholders + SEP) is baked as a graph constant, so the only runtime inputs are the pixels. Preprocess the page like Idefics3: resize so the longest edge ≀ 512, pad to 512Γ—512, rescale Γ—1/255, normalize with mean/std = 0.5, and build the pixel_attention_mask (1 for real pixels, 0 for padding).

Single-tile v1. This release ships the single 512px global-image document path: lightweight, iPhone-friendly, and accurate on typical pages. The model's full high-resolution mode (split a page into multiple 512px tiles + the global image, 800+ doc tokens) is a planned follow-up for dense small-print documents.

Repo layout

query/   colmodernvbert-query_float16_s32_static.aimodel + tokenizer/   (298 MB, fp16 β€” iPhone)
doc/     colmodernvbert-doc_float16_s89_static.aimodel                  (407 MB, fp16 β€” iPhone)
fp32/query/  colmodernvbert-query_float32_s32_static.aimodel + tokenizer/  (595 MB β€” Mac)
fp32/doc/    colmodernvbert-doc_float32_s89_static.aimodel                 (813 MB β€” Mac)
README.md Β· reference_query.json Β· reference_doc.json Β· test_doc.png

Each query/ and doc/ directory is a complete bundle root (one .aimodel, plus tokenizer/ on the query side). fp16 ships for iPhone (~705 MB for both encoders); fp32 is for Mac / max precision.

On-device (CoreAIKit)

import CoreAIKitEmbeddings

// Downloads query/ + doc/ (fp16) from this repo, or uses a sideloaded copy if present.
let retriever = try await VisualDocumentRetriever()   // .colModernVBERTQuery / .colModernVBERTDoc

// Encode a page as tiles (reliable spatial grounding), rank queries, and locate the match.
let page = try await retriever.encodeTiled(page: cgImage, rows: 6, cols: 4)
let q = try await retriever.encode(query: "total revenue in the third quarter")
let score = retriever.score(query: q, tiledPage: page)     // MaxSim, page ranking
let rect  = retriever.bestTile(query: q, tiledPage: page)  // normalized region to highlight

See Examples/DocSearch for a full iPhone demo (bundled + imported documents, query β†’ ranked pages β†’ highlighted region).

Parity (Core AI engine vs. PyTorch reference, M4 Max GPU)

Per-token cosine of the 128-d multi-vectors against the colpali_engine PyTorch model:

encoder float32 float16
query min/mean 1.000000 min 0.999997 / mean 0.999999
doc min/mean 1.000000 min 0.999994 / mean 0.999998

End-to-end retrieval: the host MaxSim reproduces processor.score exactly (max |Ξ”| = 0.0000), the engine ranking matches the PyTorch ranking on every clear-margin query, and the single-tile engine retrieves the intended page 3/3 on a rendered-text corpus.

License

MIT, inherited from ModernVBERT/colmodernvbert. See the upstream model and paper ModernVBERT: Towards Smaller Visual Document Retrievers (arXiv:2510.01149).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/ColModernVBERT-CoreAI

Finetuned
(1)
this model

Paper for mlboydaisuke/ColModernVBERT-CoreAI