Leanstral-2603 MLX 4-bit

MLX 4-bit quantization of mistralai/Leanstral-2603.

Architecture: Mistral-Small-4 — 119B total / 6.5B active per token, MoE (128 experts, 4 active), MLA attention, YARN + Llama-4 RoPE scaling, Pixtral vision encoder
Quantization: 4-bit affine, group_size=64, 4.594 bits/weight average
- mlp.gate (router) kept at 8-bit per layer
- lm_head and the vision tower / multimodal projector kept at full precision
Size: ~64 GB
Format: MLX safetensors (Apple Silicon)
Recommended hardware: Apple Silicon with >= 80 GB unified memory (M3 Ultra, M5 with 128 GB, etc.)

Usage

from mlx_vlm import load, generate

model, processor = load("mvid/Leanstral-2603-MLX-4bit")
output = generate(
    model,
    processor,
    "Prove that the sum of two even numbers is even in Lean 4.",
    max_tokens=4096,
)
print(output)

Conversion

Produced from the Mistral consolidated format via:

A streaming variant of transformers/models/mistral4/convert_mistral4_weight_to_hf.py that fuses MoE experts layer-by-layer instead of holding the full state dict in RAM, keeping FP8 storage in the HF intermediate (peak memory ~14 GB).
mlx_vlm convert -q --q-bits 4 --q-group-size 64 --dtype bfloat16 with a per-tensor mx.eval + mx.synchronize save patch to avoid macOS Metal command-buffer watchdog timeouts on the 119B model.

License: Apache 2.0 (inherited from the base model).

Downloads last month: 61

Safetensors

Model size

19B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for mvid/Leanstral-2603-MLX-4bit

Base model

mistralai/Leanstral-2603

Quantized

(10)

this model