leonsarmiento/gemma-4-31B-it-3bit-mlx

This model leonsarmiento/gemma-4-31B-it-4bit-mlx was converted to MLX format from google/gemma-4-31B-it using mlx-lm version 0.31.2.

Quantization Details

Layer Bits Group Size
embed_tokens 5 64
All other quantizable layers 4 64
  • Quantization type: Mixed (4.5-bit)
  • Total output size: ~13.8 GB
  • Method: mlx_lm.convert with custom quant_predicate

Recommended Inference Parameters

For the best performance, use the following standardized sampling configuration across all use cases:

Parameter Value
temperature 1.0
top_p 0.95
top_k 64
min_p 0.05
repeat_penalty 1.05

LM Studio — Reasoning Section Parsing

To enable thinking/reasoning output parsing:

  • Start string: <|channel>thought
  • End string: <channel|>

Add to ninja template:

{%- set enable_thinking = true %}

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("leonsarmiento/gemma-4-31B-it-4bit-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
34
Safetensors
Model size
31B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leonsarmiento/gemma-4-31B-it-4bit-mlx

Quantized
(216)
this model