Gemma 4 E4B (text) — Apple Core AI (`.aimodel`)

Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced at WWDC26), running on iOS 27 / macOS 27 via Apple's coreai-pipelined GPU engine — zero custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe).

Converted directly from Google's official QAT release google/gemma-4-E4B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class (per-block-32 absmax linear int4) — Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee by design, not by post-hoc gating.

Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: coreai-model-zoo — model card: zoo/gemma4-e4b.md.

Use it

▶️ Run it (source) — the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# → Run, then pick "Gemma 4 E4B" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model gemma-4-e4b --prompt "What can you do, offline?"

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let chat = try await ChatSession(catalog: "gemma-4-e4b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device

The take-home is Examples/ChatDemo/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same ChatSession across turns for its transcript. Multi-turn? Hold the ChatSession and call respond(to:) per turn — it keeps the conversation history; streamResponse(to:) yields tokens as they decode.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
Info.plist: none needed
Entitlements: none needed (macOS)
First run downloads the model — 7.6 GB (Mac) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

config	files	size	M4 Max decode / prefill	iPhone decode / prefill
★ provider (runs BOTH platforms)	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/`	3.7 + 3.4 GB	53.2 / 62.6	15.1 / 21.3
★ provider, iPhone-ready AOT	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, h18p = iPhone 17 Pro class only) + the same tables	3.7 + 3.4 GB	—	same as above — skip the AOT step
tbl (Mac-fastest)	`gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files	3.7 + 2.7 GB	55.8 / 61.0	not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit)

On iPhone the working set stays tiny — measured peak footprint 2.2 GB (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token).

What E4B is (config + checkpoint verified)

Clean dense model — no MoE. 42 layers (full attention every 6th), hidden 2560, intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18 KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in ios-frontend/gemma4_e4b_qat_gather_raw/), final-logit softcap 30. The QAT checkpoint prunes the never-used KV projections on the shared layers — the zoo's loader handles both layouts.

Run contract (each item is load-bearing)

Full story + traps: pipelined-engine page.

Swift stack = apple/coreai-models + the zoo's patch stack (apps/*.patch, in order). The ★ provider bundle needs EngineOptions.perTokenInputProvider (coreai-pipelined-per-token-inputs.patch); the tbl bundle needs EngineOptions.staticInputBuffers (coreai-pipelined-static-inputs.patch).
Provider mode: per token, fill ple_tokens [1,1,42,256] fp16 from the table dump — row = i8[id] * scale[id] * sqrt(256), mmap-gathered (~0.1 ms). tbl mode: bind ple_table ← embed_per_layer.i8 and ple_scale ← embed_per_layer.scale.f32 as OWNED storageModeShared MTLBuffers (buffer-backing traps in the knowledge page).
COREAI_CHUNK_THRESHOLD=1 before engine creation; never call engine.warmup() (S=1 graph; a 1-token generate after load is the warmup).
iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device specializer) — use the precompiled _aotc_h18p/ bundle, or xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapes and point metadata.json's assets.main at the .aimodelc. Ship the com.apple.developer.kernel.increased-memory-limit entitlement as headroom insurance, and bench a settled device (a just-unlocked iPhone under-reads ~35%).

Reproduce from scratch (oracle + tables are checkpoint-derived — regenerate for any new weights): conversion/export_gemma4_decode_pipelined.py with --hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized.

License

Gemma is provided under and subject to the Gemma Terms of Use (https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of google/gemma-4-E4B-it-qat-q4_0-unquantized; by downloading or using them you agree to those terms, including the Gemma Prohibited Use Policy.

Sibling repo (E2B, incl. its own official-QAT bundles): gemma-4-E2B-CoreAI.