Ornith-1.0-9B β€” Core AI (agentic coder, 48–59 tok/s on M4 Max)

Apple Core AI (.aimodel) conversion of deepreinforce-ai/Ornith-1.0-9B (text decoder): DeepReinforce's self-scaffolding agentic-coding model β€” trained to jointly solve coding tasks and construct the orchestration scaffold that guides the solution. Architecturally a Qwen3.5 hybrid decoder (model_type qwen3_5): 32 layers on a 3:1 interleave of GatedDeltaNet linear-attention mixers and gated full attention, untied 248320-vocab head. Runs fully on-device on Apple silicon via Apple's coreai-pipelined GPU engine.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/ornith-1.0-9b.md).

Bundles

bundle size M4 Max decode / prefill quality
gpu-pipelined/ornith_1_0_9b_decode_int8hu_block32_sym/ (ship) 9.8 GB 48.3 / 48.5 tok/s teacher-forced eager gate 24/24 exact vs fp32 HF oracle; engine greedy 12/12 token-exact
gpu-pipelined/ornith_1_0_9b_decode_int4lin/ (speed option) 7.5 GB 58.9 / 59.0 tok/s (+22%) ALSO gates 24/24 exact + engine 12/12 β€” the first clean int4 PTQ pass in the Qwen3.5 family (verified at short context; int8 carries the quality claim at long context)

Both are decode-only loop-free S=1 LanguageBundles (max_ctx 8192, tokenizer + chat template embedded), gated against a margin-validated fp32 HF oracle (min top-2 margin 0.205; the eager fp16 baseline is also 24/24, so int8/int4 add zero flips).

Recipe: linear int8/int4 per-block-32 body + absmax symmetric per-block-32 int8 on the untied big-vocab head (int8hu). Conversion is the zoo's stock Qwen3.5 script with nothing but an --hf-id swap β€” see conversion/README.md.

Run it (macOS)

Easiest: CoreAIChatMac (the zoo's Mac chat app β€” apps/CoreAIChatMac) downloads this repo in-app: pick Ornith-1.0-9B in the Downloads panel.

CLI (needs the zoo's coreai-models checkout + the pipelined extra-states patch):

COREAI_CHUNK_THRESHOLD=1 llm-benchmark \
    --model ornith_1_0_9b_decode_int8hu_block32_sym -p 128 -g 256 -n 3
COREAI_CHUNK_THRESHOLD=1 llm-runner \
    --model ornith_1_0_9b_decode_int8hu_block32_sym \
    --prompt "Write a rate limiter in Swift." --sampling-strategy greedy \
    --warmup exact --warmup-length 1

Notes: COREAI_CHUNK_THRESHOLD=1 before engine creation; never engine.warmup() on an S=1 bundle; Release builds only. Details + every trap: knowledge/pipelined-engine.md.

iPhone

Not this one (yet): int8 at 9.8 GB exceeds the entitled jetsam ceiling (~6.4 GB on an iPhone 17 Pro). The arithmetic route is int4 body + int8 head (β‰ˆ6.5 GB) or an in-graph int8 embed table (β‰ˆ5.5 GB) β€” tracked in the zoo card.

License

MIT β€” as declared by the upstream model card (deepreinforce-ai/Ornith-1.0-9B; the upstream repo ships no LICENSE file, so none is mirrored here). Conversion scripts and harness: see the zoo repo.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Ornith-1.0-9B-CoreAI

Finetuned
(21)
this model