Gemma 4 31B (dense) β Core AI
Apple Core AI (.aimodel) conversion of Google's Gemma 4 31B dense text decoder, ported
directly from the QAT release
google/gemma-4-31B-it-qat-q4_0-unquantized.
Decode-only, runs on the stock pipelined engine on Apple Silicon (Mac-class, ~16 GB).
Frontier dense, unblocked by a custom Metal kernel. Gemma 4 31B's full (global) attention layers have a 32-head Γ 512 Q tensor that overflows MPSGraph's GPU decode scratch heap β the stock SDPA crashes at the first token (apple/coreai-models#27, the same bug as the 12B). This bundle ships a custom flash-decode SDPA kernel on the full layers (block-GQA over the 31B's 4 global KV heads) that removes the offending op, so the model runs.
Use it
βΆοΈ Run it (source) β the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β Run, then pick "Gemma 4 31B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model gemma-4-31b --prompt "What can you do, offline?"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let chat = try await ChatSession(catalog: "gemma-4-31b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
The take-home is Examples/ChatDemo/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same ChatSession across turns for its transcript.
Multi-turn? Hold the ChatSession and call respond(to:) per turn β it keeps the
conversation history; streamResponse(to:) yields tokens as they decode.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed (macOS)
- First run downloads the model β 18.0 GB (Mac) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Bundle (gpu-pipelined/)
| bundle | quant | size | decode (M4 Max) |
|---|---|---|---|
gemma4_31b_qat_decode_int4linsym_msdpa_g8 |
int4 (q4_0-aligned absmax) | 19 GB | 17.2 tok/s (prefill 22.1) |
int4 from Google's QAT checkpoint (q4_0 grid). A frontier 31B at int4 is bandwidth-bound, so decode
is in the MLX-parity range β the value is "Core AI runs a frontier dense model the stock engine
cannot." Mac-only (exceeds the iPhone memory budget). The _g8 suffix is the higher-occupancy
flash-decode kernel (8 SIMD-groups per head split the global layers' KV scan; same numerics).
Architecture
Clean dense gemma4 text decoder β no PLE / AltUp / Laurel / MoE / KV-sharing. 60 layers,
hidden 5376, 32 heads, vocab 262144, softcap 30, tied embeddings. 5:1 sliding:full; dual head_dim
(sliding 256 / full global_head_dim 512); full layers use num_global_key_value_heads 4 with
attention_k_eq_v (value = raw k_proj). Both attention shapes ride one growing KV pair, so the
bundle loads on the stock CoreAIPipelinedEngine (2 states, no engine patch); the full layers' SDPA
runs as a custom Metal flash-decode kernel.
Usage
huggingface-cli download mlboydaisuke/Gemma-4-31B-CoreAI \
--include "gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8/*" \
--local-dir ./gemma4-31b-coreai
COREAI_CHUNK_THRESHOLD=1 llm-runner \
--model ./gemma4-31b-coreai/gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8 \
--prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1
Conversion
Community zoo:
github.com/john-rocky/coreai-model-zoo β zoo/gemma4-31b.md.
License
Gemma β governed by the Gemma Terms of Use.
Model tree for mlboydaisuke/Gemma-4-31B-CoreAI
Base model
google/gemma-4-31B