Gemma 4 E4B (text) β Apple Core AI (.aimodel)
Gemma 4 E4B's text decoder converted to Apple's Core AI (the Core ML successor announced
at WWDC26), running on iOS 27 / macOS 27 via Apple's coreai-pipelined GPU engine β zero
custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and
the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe).
Converted directly from Google's official QAT release google/gemma-4-E4B-it-qat-q4_0-unquantized: bf16 weights trained for q4_0 rounding, and q4_0 is this bundle's quantization class (per-block-32 absmax linear int4) β Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee by design, not by post-hoc gating.
Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: coreai-model-zoo β model card:
zoo/gemma4-e4b.md.
Use it
βΆοΈ Run it (source) β the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β Run, then pick "Gemma 4 E4B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model gemma-4-e4b --prompt "What can you do, offline?"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let chat = try await ChatSession(catalog: "gemma-4-e4b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
The take-home is Examples/ChatDemo/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same ChatSession across turns for its transcript.
Multi-turn? Hold the ChatSession and call respond(to:) per turn β it keeps the
conversation history; streamResponse(to:) yields tokens as they decode.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed (macOS)
- First run downloads the model β 7.6 GB (Mac) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Measured (greedy; M4 Max / iPhone 17 Pro, settled device)
| config | files | size | M4 Max decode / prefill | iPhone decode / prefill |
|---|---|---|---|---|
| β provider (runs BOTH platforms) | gpu-pipelined/gemma4_e4b_qat_decode_int4lin/ + ios-frontend/gemma4_e4b_qat_gather_raw/ |
3.7 + 3.4 GB | 53.2 / 62.6 | 15.1 / 21.3 |
| β provider, iPhone-ready AOT | gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/ (precompiled .aimodelc, h18p = iPhone 17 Pro class only) + the same tables |
3.7 + 3.4 GB | β | same as above β skip the AOT step |
| tbl (Mac-fastest) | gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/ + the two embed_per_layer.* table files |
3.7 + 2.7 GB | 55.8 / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) |
On iPhone the working set stays tiny β measured peak footprint 2.2 GB (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token).
What E4B is (config + checkpoint verified)
Clean dense model β no MoE. 42 layers (full attention every 6th), hidden 2560,
intermediate 10240 uniform, 8 query heads / 2 KV heads, dual head_dim 256/512, 18
KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded
KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in
ios-frontend/gemma4_e4b_qat_gather_raw/), final-logit softcap 30. The QAT checkpoint prunes
the never-used KV projections on the shared layers β the zoo's loader handles both layouts.
Run contract (each item is load-bearing)
Full story + traps: pipelined-engine page.
- Swift stack =
apple/coreai-models+ the zoo's patch stack (apps/*.patch, in order). The β provider bundle needsEngineOptions.perTokenInputProvider(coreai-pipelined-per-token-inputs.patch); the tbl bundle needsEngineOptions.staticInputBuffers(coreai-pipelined-static-inputs.patch). - Provider mode: per token, fill
ple_tokens [1,1,42,256]fp16 from the table dump βrow = i8[id] * scale[id] * sqrt(256), mmap-gathered (~0.1 ms). tbl mode: bindple_tableβembed_per_layer.i8andple_scaleβembed_per_layer.scale.f32as OWNEDstorageModeSharedMTLBuffers (buffer-backing traps in the knowledge page). COREAI_CHUNK_THRESHOLD=1before engine creation; never callengine.warmup()(S=1 graph; a 1-token generate after load is the warmup).- iPhone: AOT is mandatory (the 3.7 GB-constants graph crashes the on-device
specializer) β use the precompiled
_aotc_h18p/bundle, orxcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapesand pointmetadata.json'sassets.mainat the.aimodelc. Ship thecom.apple.developer.kernel.increased-memory-limitentitlement as headroom insurance, and bench a settled device (a just-unlocked iPhone under-reads ~35%).
Reproduce from scratch (oracle + tables are checkpoint-derived β regenerate for any new
weights): conversion/export_gemma4_decode_pipelined.py
with --hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized.
License
Gemma is provided under and subject to the Gemma Terms of Use
(https://ai.google.dev/gemma/terms). These .aimodel bundles are Model Derivatives of
google/gemma-4-E4B-it-qat-q4_0-unquantized;
by downloading or using them you agree to those terms, including the
Gemma Prohibited Use Policy.
Sibling repo (E2B, incl. its own official-QAT bundles): gemma-4-E2B-CoreAI.
- Downloads last month
- 27
Model tree for mlboydaisuke/gemma-4-E4B-CoreAI
Base model
google/gemma-4-E4B