Instructions to use bearzi/MiMo-V2.5-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bearzi/MiMo-V2.5-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MiMo-V2.5-MLX bearzi/MiMo-V2.5-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
MiMo-V2.5 โ text-only, packaged for MLX
This is a convenience repackaging of XiaomiMiMo/MiMo-V2.5
for inference on Apple Silicon via MLX.
Every individual weight value is bit-identical to the upstream release. This repository repacks the 32 source safetensors shards into a single file, stacks per-expert MoE weights along a new leading axis (required by MLX MoE loaders), and excludes multimodal weights (vision, audio, MTP heads) since this targets the text-only inference path.
Validated on M3 Ultra 512 GB: 30.0 tok/s decode (warm), ~535 tok/s prefill
at L=2048 โ running on stock upstream MLX (no fork) plus the
mlx-block-fp8 extension.
Setup
MiMo-V2.5 uses the block_fp8 quantization scheme (DeepSeek-V3-style: unpacked
E4M3 codes + 2D FP32 scales), which is not built into MLX. The kernels for it
ship as a standalone MLX extension that builds against stock upstream MLX โ no
MLX fork required.
1. MLX (built shared)
Build MLX from upstream with BUILD_SHARED_LIBS=ON (required โ a static build
causes the extension to get a duplicate runtime and silently produce zeros):
git clone https://github.com/ml-explore/mlx.git
cd mlx
CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON" \
python3 -m pip install -e . --no-build-isolation
2. The mlx-block-fp8 extension
The block_fp8 matmul + MoE kernels:
git clone https://github.com/yohann-bearzi/mlx-block-fp8.git
cd mlx-block-fp8
python3 -m pip install -e . --no-build-isolation
3. mlx_lm with mimo_v2 base support
MiMo-V2 base model support (until merged upstream):
ml-explore/mlx-lm#1219.
Install mlx_lm from that branch, or copy mimo_v2.py + switch_layers.py
from it into your mlx_lm/models/ directory.
4. The block_fp8 model class (shipped in this repo)
Copy mimo_v2_block_fp8.py into your mlx_lm models directory:
cp mimo_v2_block_fp8.py $(python3 -c 'import mlx_lm, os; print(os.path.dirname(mlx_lm.__file__))')/models/
5. (Optional, +~3%) the SDPA fix
For full decode throughput, apply ml-explore/mlx#3637, which enables MLXโs fused attention kernel for MiMoโs asymmetric Q/V head dims (192/128). Without it, attention uses a slower compiled-graph fallback.
| Setup | decode tok/s (M3 Ultra, warm) |
|---|---|
| stock MLX + extension | 29.0 |
| stock MLX + extension + SDPA #3637 | 30.0 |
Usage
import json, mlx.core as mx
import mlx_lm.utils as U
mx.set_wired_limit(300 * 1024**3) # M3 Ultra 512GB
cfg = json.load(open("config.json"))
cfg["model_type"] = "mimo_v2_block_fp8"
mc, ac = U._get_classes(cfg)
m = mc(ac.from_dict(cfg))
w = mx.load("mimo_v2.5_block_fp8.safetensors")
w = m.sanitize_block_fp8(w)
m.apply_block_fp8(w)
del w
import gc; gc.collect()
mx.eval(m.parameters())
Reproducing from source
If youโd rather convert from XiaomiMiMoโs release yourself, this repo ships the converter script. Download Xiaomiโs shards, then:
python3 convert_mimo.py --src /path/to/XiaomiMiMo/MiMo-V2.5 \
--out /path/to/output/mimo_v2.5_block_fp8.safetensors
The converter does two things only:
- Concatenates the 32 source safetensors shards into a single file.
- Stacks the 256 per-expert MoE weights per layer into one tensor with shape [256, โฆ] per projection (required by MLX MoE loaders).
No quantization, permutation, scale manipulation, or padding is applied. Conversion takes ~10 minutes on a fast SSD.
Whatโs in this repo
| file | what |
|---|---|
mimo_v2.5_block_fp8.safetensors |
290 GB โ repacked weights (bit-identical to upstream) |
config.json |
upstream config + model_type set to mimo_v2_block_fp8 |
tokenizer.json, tokenizer_config.json |
upstream, verbatim |
generation_config.json |
upstream, verbatim |
mimo_v2_block_fp8.py |
MLX model class (drop into mlx_lm/models/) |
convert_mimo.py |
converter script โ reproduce this file from XiaomiMiMoโs release |
LICENSE |
MIT (matches upstream) |
NOTICE |
derivative-work statement |
- Downloads last month
- 415
Quantized
Model tree for bearzi/MiMo-V2.5-MLX
Base model
XiaomiMiMo/MiMo-V2.5