Qwen3-VL 2B — Core AI (`.aimodel`)

The first vision-language model on Apple's Core AI framework (iOS 27 / macOS 27): Qwen/Qwen3-VL-2B-Instruct converted to .aimodel, running image+text → text fully on the GPU via Apple's coreai-pipelined engine — zero custom kernels.

Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.

Qwen3-VL 2B on iPhone 17 Pro — in the zoo's CoreAIChat app, real speed.

Use it

▶️ Run it (source) — the VLChat runner (GUI + CLI, one app for every vision-language model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/VLChat/VLChat.xcodeproj
# → Run, then pick "Qwen3-VL 2B" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/VLChat
swift run vlchat-cli --model qwen3-vl-2b --image sample.jpg --prompt "What is in this image?"

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit
import FoundationModels

let vlm = try await KitVisionModel(catalog: "qwen3-vl-2b")
let session = LanguageModelSession(model: vlm)
let image = try ImageFile.load(imageURL)  // any image file → CGImage + EXIF orientation
let reply = try await session.respond(to: Prompt {
    prompt
    Attachment(image.cgImage, orientation: image.orientation)
})
// reply.content: the answer about the image, generated fully on-device

The take-home is Examples/VLChat/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitVisionModel(catalog:) behind a LanguageModelSession. Multi-turn about the same image? Hold the LanguageModelSession and call respond(to:) per turn. The photo picker / file chooser is your app's own chrome — ImageFile.load (kit API) turns any image file into model input.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
Info.plist: NSPhotoLibraryUsageDescription — only if you use PhotosPicker
Entitlements (iOS): com.apple.developer.kernel.increased-memory-limit
First run downloads the model — 3.3 GB (Mac) / 3.3 GB (iPhone) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

Measured

platform	prefill tok/s	decode tok/s	numerics
M4 Max (macOS 27 beta)	191.0	187.6	full multimodal oracle gates vs fp32-HF PASS
iPhone 17 Pro (iOS 27 beta, settled)	33.9	33.3	text + image prompts 24/24 × 8 runs, token-identical to Mac (~92% of the naive BW ceiling)

Vision encode: ~60-80 ms/image (Mac GPU). Device cold load 12.3 s (on-device GPU specialization, no AOT), warm 0.6–5 s. The 2.3 GB decoder wants the increased-memory entitlement on iPhone.

Files

path	what	size
`gpu-pipelined/qwen3_vl_2b_instruct_decode_int8hu_s1/`	text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included)	2.3 GB
`gpu-pipelined/qwen3_vl_2b_instruct_vision/`	fixed-grid vision encoder (448×448 → 196 tokens + DeepStack), fp16	0.77 GB
`gpu-pipelined/qwen3_vl_2b_instruct_decode_int8lin_s1/`	decoder alt: tied fp16 head (slower, smaller-RAM-spike option)	2.0 GB

How it works (short version)

The text-only pipelined engine carries the VLM through an id-space trick — no engine code changes beyond the published static-inputs patch:

the vision encoder runs once per image; its embeddings ride 4 static graph inputs (rewritable owned MTLBuffers, ~3 MB),
the prompt's <|image_pad|> ids become extension ids vocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way,
interleaved M-RoPE is derived in-graph from (ids, position) alone — image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.

Numerics are gated the zoo way: fp32-HF oracle → torch ladder (position formula exact vs get_rope_index, 28/28 layers) → .aimodel GPU gates → engine ≡ python 24/24 → device 24/24.

Run it

The zoo's apps/CoreAIChat (iOS) has a Qwen3-VL mode with a photo picker and downloads this repo in-app. For the run contract (S=1 prefill, COREAI_CHUNK_THRESHOLD=1, never engine.warmup()), see knowledge/pipelined-engine.md.

Conversion is reproducible from the zoo: conversion/export_qwen3_vl_pipelined.py int8hu.

License

Apache-2.0 (inherited from Qwen3-VL-2B-Instruct). Conversion code BSD-3-Clause (zoo repo).

Downloads last month: 56

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Qwen3-VL-2B-CoreAI

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(230)

this model

Qwen3-VL 2B — Core AI (.aimodel)