Leanstral-2603 MLX 5-bit

MLX 5-bit quantization of mistralai/Leanstral-2603.

  • Architecture: Mistral-Small-4 — 119B total / 6.5B active per token, MoE (128 experts, 4 active), MLA attention, YARN + Llama-4 RoPE scaling, Pixtral vision encoder
  • Quantization: 5-bit affine, group_size=64, 5.585 bits/weight average
    • mlp.gate (router) kept at 8-bit per layer
    • lm_head and the vision tower / multimodal projector kept at full precision
  • Size: ~78 GB
  • Format: MLX safetensors (Apple Silicon)
  • Recommended hardware: Apple Silicon with >= 96 GB unified memory (M3 Ultra, M5 with 128 GB, etc.)

Usage

from mlx_vlm import load, generate

model, processor = load("mvid/Leanstral-2603-MLX-5bit")
output = generate(
    model,
    processor,
    "Prove that the sum of two even numbers is even in Lean 4.",
    max_tokens=4096,
)
print(output)

Conversion

Produced from the Mistral consolidated format via:

  1. A streaming variant of transformers/models/mistral4/convert_mistral4_weight_to_hf.py that fuses MoE experts layer-by-layer instead of holding the full state dict in RAM, keeping FP8 storage in the HF intermediate (peak memory ~14 GB).
  2. mlx_vlm convert -q --q-bits 5 --q-group-size 64 --dtype bfloat16 with a per-tensor mx.eval + mx.synchronize save patch to avoid macOS Metal command-buffer watchdog timeouts on the 119B model.

License: Apache 2.0 (inherited from the base model).

Downloads last month
41
Safetensors
Model size
23B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mvid/Leanstral-2603-MLX-5bit

Quantized
(10)
this model