Qwen3-VL 2B β Core AI (.aimodel)
The first vision-language model on Apple's Core AI framework (iOS 27 /
macOS 27): Qwen/Qwen3-VL-2B-Instruct converted to .aimodel, running
image+text β text fully on the GPU via Apple's coreai-pipelined engine β
zero custom kernels.
Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.
Qwen3-VL 2B on iPhone 17 Pro β in the zoo's CoreAIChat app, real speed.
Use it
βΆοΈ Run it (source) β the VLChat runner (GUI + CLI, one app for every vision-language model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/VLChat/VLChat.xcodeproj
# β Run, then pick "Qwen3-VL 2B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/VLChat
swift run vlchat-cli --model qwen3-vl-2b --image sample.jpg --prompt "What is in this image?"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
import FoundationModels
let vlm = try await KitVisionModel(catalog: "qwen3-vl-2b")
let session = LanguageModelSession(model: vlm)
let image = try ImageFile.load(imageURL) // any image file β CGImage + EXIF orientation
let reply = try await session.respond(to: Prompt {
prompt
Attachment(image.cgImage, orientation: image.orientation)
})
// reply.content: the answer about the image, generated fully on-device
The take-home is Examples/VLChat/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same KitVisionModel(catalog:) behind a LanguageModelSession.
Multi-turn about the same image? Hold the LanguageModelSession and call respond(to:)
per turn. The photo picker / file chooser is your app's own chrome β ImageFile.load
(kit API) turns any image file into model input.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist:
NSPhotoLibraryUsageDescriptionβ only if you use PhotosPicker - Entitlements (iOS):
com.apple.developer.kernel.increased-memory-limit - First run downloads the model β 3.3 GB (Mac) / 3.3 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Measured
| platform | prefill tok/s | decode tok/s | numerics |
|---|---|---|---|
| M4 Max (macOS 27 beta) | 191.0 | 187.6 | full multimodal oracle gates vs fp32-HF PASS |
| iPhone 17 Pro (iOS 27 beta, settled) | 33.9 | 33.3 | text + image prompts 24/24 Γ 8 runs, token-identical to Mac (~92% of the naive BW ceiling) |
Vision encode: ~60-80 ms/image (Mac GPU). Device cold load 12.3 s (on-device GPU specialization, no AOT), warm 0.6β5 s. The 2.3 GB decoder wants the increased-memory entitlement on iPhone.
Files
| path | what | size |
|---|---|---|
gpu-pipelined/qwen3_vl_2b_instruct_decode_int8hu_s1/ |
text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 2.3 GB |
gpu-pipelined/qwen3_vl_2b_instruct_vision/ |
fixed-grid vision encoder (448Γ448 β 196 tokens + DeepStack), fp16 | 0.77 GB |
gpu-pipelined/qwen3_vl_2b_instruct_decode_int8lin_s1/ |
decoder alt: tied fp16 head (slower, smaller-RAM-spike option) | 2.0 GB |
How it works (short version)
The text-only pipelined engine carries the VLM through an id-space trick β no engine code changes beyond the published static-inputs patch:
- the vision encoder runs once per image; its embeddings ride 4 static
graph inputs (rewritable owned
MTLBuffers, ~3 MB), - the prompt's
<|image_pad|>ids become extension idsvocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way, - interleaved M-RoPE is derived in-graph from (ids, position) alone β image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.
Numerics are gated the zoo way: fp32-HF oracle β torch ladder (position
formula exact vs get_rope_index, 28/28 layers) β .aimodel GPU gates β
engine β‘ python 24/24 β device 24/24.
Run it
The zoo's apps/CoreAIChat (iOS) has a Qwen3-VL mode with a photo picker
and downloads this repo in-app. For the run contract (S=1 prefill,
COREAI_CHUNK_THRESHOLD=1, never engine.warmup()), see
knowledge/pipelined-engine.md.
Conversion is reproducible from the zoo:
conversion/export_qwen3_vl_pipelined.py int8hu.
License
Apache-2.0 (inherited from Qwen3-VL-2B-Instruct). Conversion code BSD-3-Clause (zoo repo).
- Downloads last month
- 56
Model tree for mlboydaisuke/Qwen3-VL-2B-CoreAI
Base model
Qwen/Qwen3-VL-2B-Instruct