Leanstral-2603 MLX 3-bit

MLX 3-bit quantization of mistralai/Leanstral-2603.

  • Architecture: Mistral-Small-4 — 119B total / 6.5B active per token, MoE (128 experts, 4 active), MLA attention, YARN + Llama-4 RoPE scaling, Pixtral vision encoder
  • Quantization: 3-bit affine, group_size=64, 3.602 bits/weight average
    • mlp.gate (router) kept at 8-bit per layer
    • lm_head and the vision tower / multimodal projector kept at full precision
  • Size: ~50 GB
  • Format: MLX safetensors (Apple Silicon)
  • Recommended hardware: Apple Silicon with >= 64 GB unified memory
  • Note: 3-bit is the most aggressive quant in this family. Expect some quality degradation vs the 4-bit and higher variants; use 4-bit or higher for fidelity-sensitive work.

Usage

from mlx_vlm import load, generate

model, processor = load("mvid/Leanstral-2603-MLX-3bit")
output = generate(
    model,
    processor,
    "Prove that the sum of two even numbers is even in Lean 4.",
    max_tokens=4096,
)
print(output)

Conversion

Produced from the Mistral consolidated format via:

  1. A streaming variant of transformers/models/mistral4/convert_mistral4_weight_to_hf.py that fuses MoE experts layer-by-layer instead of holding the full state dict in RAM, keeping FP8 storage in the HF intermediate (peak memory ~14 GB).
  2. mlx_vlm convert -q --q-bits 3 --q-group-size 64 --dtype bfloat16 with a per-tensor mx.eval + mx.synchronize save patch to avoid macOS Metal command-buffer watchdog timeouts on the 119B model.

License: Apache 2.0 (inherited from the base model).

Downloads last month
53
Safetensors
Model size
16B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mvid/Leanstral-2603-MLX-3bit

Quantized
(10)
this model