Leanstral-2603 MLX 3-bit

MLX 3-bit quantization of mistralai/Leanstral-2603.

Architecture: Mistral-Small-4 — 119B total / 6.5B active per token, MoE (128 experts, 4 active), MLA attention, YARN + Llama-4 RoPE scaling, Pixtral vision encoder
Quantization: 3-bit affine, group_size=64, 3.602 bits/weight average
- mlp.gate (router) kept at 8-bit per layer
- lm_head and the vision tower / multimodal projector kept at full precision
Size: ~50 GB
Format: MLX safetensors (Apple Silicon)
Recommended hardware: Apple Silicon with >= 64 GB unified memory
Note: 3-bit is the most aggressive quant in this family. Expect some quality degradation vs the 4-bit and higher variants; use 4-bit or higher for fidelity-sensitive work.

Usage

from mlx_vlm import load, generate

model, processor = load("mvid/Leanstral-2603-MLX-3bit")
output = generate(
    model,
    processor,
    "Prove that the sum of two even numbers is even in Lean 4.",
    max_tokens=4096,
)
print(output)

Conversion

Produced from the Mistral consolidated format via:

A streaming variant of transformers/models/mistral4/convert_mistral4_weight_to_hf.py that fuses MoE experts layer-by-layer instead of holding the full state dict in RAM, keeping FP8 storage in the HF intermediate (peak memory ~14 GB).
mlx_vlm convert -q --q-bits 3 --q-group-size 64 --dtype bfloat16 with a per-tensor mx.eval + mx.synchronize save patch to avoid macOS Metal command-buffer watchdog timeouts on the 119B model.

License: Apache 2.0 (inherited from the base model).

Downloads last month: 53

Safetensors

Model size

16B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for mvid/Leanstral-2603-MLX-3bit

Base model

mistralai/Leanstral-2603

Quantized

(10)

this model