MiMo-V2.5 — text-only, packaged for MLX

This is a convenience repackaging of XiaomiMiMo/MiMo-V2.5 for inference on Apple Silicon via MLX.

Every individual weight value is bit-identical to the upstream release. This repository repacks the 32 source safetensors shards into a single file, stacks per-expert MoE weights along a new leading axis (required by MLX MoE loaders), and excludes multimodal weights (vision, audio, MTP heads) since this targets the text-only inference path.

Validated on M3 Ultra 512 GB: 30.0 tok/s decode (warm), ~535 tok/s prefill at L=2048 — running on stock upstream MLX (no fork) plus the mlx-block-fp8 extension.

Setup

MiMo-V2.5 uses the block_fp8 quantization scheme (DeepSeek-V3-style: unpacked E4M3 codes + 2D FP32 scales), which is not built into MLX. The kernels for it ship as a standalone MLX extension that builds against stock upstream MLX — no MLX fork required.

1. MLX (built shared)

Build MLX from upstream with BUILD_SHARED_LIBS=ON (required — a static build causes the extension to get a duplicate runtime and silently produce zeros):

git clone https://github.com/ml-explore/mlx.git
cd mlx
CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON" \
  python3 -m pip install -e . --no-build-isolation

2. The `mlx-block-fp8` extension

The block_fp8 matmul + MoE kernels:

git clone https://github.com/yohann-bearzi/mlx-block-fp8.git
cd mlx-block-fp8
python3 -m pip install -e . --no-build-isolation

3. `mlx_lm` with `mimo_v2` base support

MiMo-V2 base model support (until merged upstream): ml-explore/mlx-lm#1219. Install mlx_lm from that branch, or copy mimo_v2.py + switch_layers.py from it into your mlx_lm/models/ directory.

4. The block_fp8 model class (shipped in this repo)

Copy mimo_v2_block_fp8.py into your mlx_lm models directory:

cp mimo_v2_block_fp8.py $(python3 -c 'import mlx_lm, os; print(os.path.dirname(mlx_lm.__file__))')/models/

5. (Optional, +~3%) the SDPA fix

For full decode throughput, apply ml-explore/mlx#3637, which enables MLX’s fused attention kernel for MiMo’s asymmetric Q/V head dims (192/128). Without it, attention uses a slower compiled-graph fallback.

Setup	decode tok/s (M3 Ultra, warm)
stock MLX + extension	29.0
stock MLX + extension + SDPA #3637	30.0

Usage

import json, mlx.core as mx
import mlx_lm.utils as U

mx.set_wired_limit(300 * 1024**3)  # M3 Ultra 512GB

cfg = json.load(open("config.json"))
cfg["model_type"] = "mimo_v2_block_fp8"
mc, ac = U._get_classes(cfg)
m = mc(ac.from_dict(cfg))

w = mx.load("mimo_v2.5_block_fp8.safetensors")
w = m.sanitize_block_fp8(w)
m.apply_block_fp8(w)
del w
import gc; gc.collect()
mx.eval(m.parameters())

Reproducing from source

If you’d rather convert from XiaomiMiMo’s release yourself, this repo ships the converter script. Download Xiaomi’s shards, then:

python3 convert_mimo.py --src /path/to/XiaomiMiMo/MiMo-V2.5 \
                        --out /path/to/output/mimo_v2.5_block_fp8.safetensors

The converter does two things only:

Concatenates the 32 source safetensors shards into a single file.
Stacks the 256 per-expert MoE weights per layer into one tensor with shape [256, …] per projection (required by MLX MoE loaders).

No quantization, permutation, scale manipulation, or padding is applied. Conversion takes ~10 minutes on a fast SSD.

What’s in this repo

file	what
`mimo_v2.5_block_fp8.safetensors`	290 GB — repacked weights (bit-identical to upstream)
`config.json`	upstream config + `model_type` set to `mimo_v2_block_fp8`
`tokenizer.json`, `tokenizer_config.json`	upstream, verbatim
`generation_config.json`	upstream, verbatim
`mimo_v2_block_fp8.py`	MLX model class (drop into `mlx_lm/models/`)
`convert_mimo.py`	converter script — reproduce this file from XiaomiMiMo’s release
`LICENSE`	MIT (matches upstream)
`NOTICE`	derivative-work statement

Downloads last month: 415

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bearzi/MiMo-V2.5-MLX

Base model

XiaomiMiMo/MiMo-V2.5