Llama 3.2 1B Instruct (Stateful KV) — CoreML Stateful KV-Cache (ctx≤512)

On-device CoreML .mlpackage converted from meta-llama/Llama-3.2-1B-Instruct with stateful KV cache (ct.StateType) for efficient token-by-token generation.

Designed for the coremlkv: runner in GrioKitCoreMLTranslationProvider handles stateful generation via makeState() + prediction(from:using:state:).

Model details

Spec Value
Base meta-llama/Llama-3.2-1B-Instruct
Precision FP16 mlprogram
Context ≤512 tokens (prefill + decode combined)
Inputs inputIds Int32 [1, ≤512], causalMask Float16 [1, 1, 1, ≤512]
Output logits Float16 [1, q_len, 128256]
States keyCache, valueCache (Float16, ct.StateType)
Min OS iOS 18 / macOS 15
Compute .cpuAndGPU recommended — do NOT use .cpuAndNE for this model (ANE produces incorrect tokens; see perf table)
Format .mlpackage + .mlmodelc (pre-compiled; avoids ANE recompilation on first load)
Toolchain coremltools 9.0 + torch 2.7 + transformers 5.8.1

Verified output (greedy, deterministic)

Smoke harness: scripts/coreml/path_b_convert.py --model llama-3.2-1b-stateful --smoke-only against the pre-compiled .mlmodelc on .cpuAndGPU.

Basic capability ("The capital of France is"): "The capital of France is Paris." — 8 decode tokens. Matches PyTorch FP16 reference.

Translation pack (10 cases, system prompt: "You are a French translator. Translate the user message to French. Output only the translation."):

Input Output
The cat is on the table. Le chat est sur la table.
The quick brown fox jumps over the fence. Le renard rapide saute sur le fermette. (missing "brown"; "fermette" = small farm, not "fence")
Such is Life La vie est une chose (incomplete — Llama 1B weak on idioms; should be C'est la vie)
To have a strike of lightning Pour avoir un coup de tonnerre (thunder, not lightning)
Life in pink La vie en rose (better than Qwen 1.5B — got the article right)
This is a dangerous sport. C'est un sport dangereux.
You need sincere focus, and the work is not easy. Je suis déterminé, et le travail est difficile. (paraphrased to first person)
Bill France Sr. was the founder of NASCAR. And, and, you know, I traveled in that garage for 20 plus weekends, for years and years and years and years. Bill France Sr. était le fondateur de NASCAR. Et, et, on sait, j'ai passé plus de 20 weekends, pendant des années et des années et des années. (faithful — kept "weekends" in English vs Qwen's invented "20 à 30 semaines")
Wordsworth stanza 1 (6 lines) Full stanza in 69 tokens; uses Je me suis senti seul ("I felt lonely") + groupe de fleurs d'or instead of explicit "daffodils"
Wordsworth stanzas 1–2 (12 lines) Both stanzas in 136 tokens; uses daffodilles d'or, lune étoilée ("starry moon") for Milky Way (closer than Qwen's brume)

Postprocess pack (8 cases, light-touch transcript copy-edit): Llama 1B beats Qwen 1.5B on the headline sight → site diagnostic (case 3: we need to make sure the site is stable before launch — Qwen kept sight). It also preserves fillers more faithfully on case 1 (well, i think we should probably start with the budget review and then move into hiring. — Qwen dropped well, probably and rephrased). Weaker on capitalization (such, life in pink stay lowercase; Qwen capitalized). Net: better at the substantive fix-vs-leave-alone judgment, worse on surface formatting.

Headline takeaway: Llama 3.2 1B is a usable on-device French translator for short literal sentences and is the better light-touch copy-editor than Qwen 1.5B. Idiom handling and "translator vs. chatter" instruction-following are weaker than Qwen — pick by use case.

Observed performance

Device: Apple M1 Pro, 16 GB, macOS 26.5 Tahoe. Prompt: "The capital of France is" (cold prefill row) and translation-pack averages (decode row).

Compute units Cold prefill Steady decode Result
.cpuAndGPU 3286 ms 126–162 ms/tok (translation pack, warm) ✅ recommended; reliable output across all packs
.cpuAndNE 5818 ms 96 ms/tok incorrect output — ANE numerical drift produces "Sainte Anne" for "The capital of France is" (expected "Paris."). Llama 1B is NOT ANE-stable in this build, unlike the Qwen 2.5 sibling. Do not use.
.all 13179 ms 165 ms/tok ✅ correct output; cold prefill 4× worse than .cpuAndGPU and decode no faster — no advantage over .cpuAndGPU
.cpuOnly ❌ not tested; known BNNS FP16 NaN bug on transformers — do not use

Translation-pack warm prefill range on .cpuAndGPU: 168–588 ms (depends on prompt length, 30–80 tokens after chat template). Decode latency stable at ~130 ms/tok across all 10 translation cases and 8 postprocess cases.

Recommendation: ship with .cpuAndGPU for production. The 1.8× ANE decode speedup that helps the Qwen 1.5B sibling does NOT apply here — Llama 1B produces semantically wrong output on ANE in this build, likely due to the FP32 SDPA islands interacting with ANE's Llama-specific op scheduling. If you need ANE-class throughput on a 1–2B Apple-Silicon model, use qwen2.5-1.5b-stateful instead.

Runtime gotchas (please read before integrating)

  1. Llama 3.2 uses <|eot_id|> (id 128009) as the chat-turn terminator in addition to <|end_of_text|> (id 128001). Read generation_config.json for the complete stop list.
  2. Do not load on .cpuAndNE. ANE produces incorrect tokens for this model in this conversion (see perf table). Pin computeUnits: .cpuAndGPU when configuring the runner.
  3. causalMask is a length signal, not the actual causal mask — shape [1, 1, 1, end_step], all-zero float16. The stateful wrapper inside the model reads only end_step from its shape and builds the proper (1, 1, q_len, end_step) lower-triangular mask internally. Do not try to send a "real" causal mask through the public interface — its query dimension is pinned to 1 in the export.
  4. Reset state between independent generations. Call CoreMLTranslationProvider.shared.resetSession() (or model.resetState() on the stateful LanguageModel) between unrelated requests. Stale KV state from a previous sequence corrupts subsequent generations.
  5. .mlmodelc is included alongside .mlpackage. Use the .mlmodelc path for loading — it skips the multi-minute ANE compilation that happens on first .mlpackage load.

Conversion notes (for the CoreML community)

  1. Use attn_implementation="sdpa". Llama 3.2 export is cleaner with SDPA; eager produces a brittler op graph.
  2. Use torch.export.default_decompositions() for run_decompositions. {} (all decompositions) can SIGSEGV in coremltools at this scale on some macOS / coremltools combinations.
  3. create_causal_mask in transformers ≥ 5 returns 4-D masks unchanged. Sending a [1, 1, 1, end_step] length-signal mask straight into attention layers therefore broadcasts across the query axis and gives BIDIRECTIONAL attention during batch prefill. Llama 3.2 1B (4:1 GQA, 8 KV heads) is robust to this contamination — earlier conversions without the proper internal mask still produced sensible output, which initially masked the bug. The current build's wrapper constructs the correct (1, 1, q_len, end_step) lower-triangular mask internally so every prefill is causally correct; Llama output is now strictly more correct than before, not just "good enough by accident".
  4. FP32 SDPA islands — the wrapper runs attention in FP32 inside an otherwise-FP16 model. For Llama 1B this is a no-op safety net (its activations fit FP16 comfortably) but it does interact with ANE op scheduling and is the most likely cause of the .cpuAndNE divergence noted above. Removing the FP32 island would require Qwen-specific handling — a tradeoff against ANE correctness for Llama. Current build prioritizes correctness across model families over ANE compatibility for Llama specifically.
  5. FP16 greedy decoding is not byte-deterministic across backends. Semantically equivalent to PyTorch FP16 reference for every prompt we have tested, except on .cpuAndNE for this model (see perf table).

Usage (Swift — GrioKit coremlkv: runner)

import GrioKit

// Load via CoreMLTranslationProvider using the coremlkv: prefix
let provider = CoreMLTranslationProvider.shared
try await provider.configure(modelID: "coremlkv:llama-3.2-1b-stateful", computeUnits: .cpuAndGPU)

// Translate a sentence (stateful: KV cache persists across tokens)
let result = try await provider.translate(
    "The quick brown fox jumps over the lazy dog.",
    sourceLanguage: "English",
    targetLanguage: "French"
)
print(result) // e.g. "Le rapide renard brun saute par-dessus le chien paresseux."

// Reset KV state between unrelated requests
await provider.resetSession()

Keep tokenizer.json, tokenizer_config.json, config.json, generation_config.json (and chat_template.jinja for Gemma) as siblings of the .mlpackage on disk. All are bundled in this repo.

Reproducibility

Conversion done with coremltools==9.0, torch==2.7.0, transformers==5.8.1 via path_b_convert.py --model llama-3.2-1b-stateful --stateful.

Stateful KV weights are ct.StateType buffers (keyCache, valueCache), shaped (num_layers, 1, num_kv_heads, 512, head_dim). Post-convert quantization (--quantize int8) applies linear_quantize_weights with linear_symmetric mode to the saved .mlpackage; the state buffers are runtime tensors and are unaffected by weight quantization.

License

Llama 3.2 Community License. Weights from meta-llama/Llama-3.2-1B-Instruct by Meta. Re-uploaded as a CoreML port; original model card terms apply.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anziank/grio-llama-3.2-1b-instruct-coreml-stateful

Quantized
(373)
this model