Instructions to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF",
	filename="gemma-4-26B-A4B-it-cerebellum-v6.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
# Run inference directly in the terminal:
./llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Use Docker

docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

LM Studio
Jan

vLLM

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Ollama
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Ollama:
```
ollama run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
```

Unsloth Studio new

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF to start chatting

Pi new

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Run Hermes

hermes

Docker Model Runner
How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Docker Model Runner:
```
docker model run hf.co/deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF
```

Lemonade

How to use deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull deucebucket/Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF

Run and chat with the model

lemonade run user.Gemma-4-26B-A4B-it-Cerebellum-v6-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Gemma 4 26B-A4B-it — Cerebellum v6 GGUF

Numbers under audit (2026-05-08) — an internal review found the v6 benchmark numbers below need to be re-measured against the same protocol used for v1–v4. A clean re-run with audited wrong-answers and per-question JSONLs is underway. The GGUF file itself is unchanged — this is a measurement issue, not a model issue. Treat the table as preliminary until corrected numbers replace it.

Cerebellum v6 is an ablation-guided mixed-precision GGUF quantization of google/gemma-4-26B-A4B-it.

This is a 26B-parameter MoE model with 4B active parameters per token, 128 experts per layer, and 30 layers. This release uses tensor-level precision overrides selected from 140+ ablation experiments across six internal iterations, including per-layer MoE router surgery.

At a Glance


File	`gemma-4-26B-A4B-it-cerebellum-v6.gguf`
Size	11.7 GB
Base model	`google/gemma-4-26B-A4B-it`
Base quant	Q3_K_M with bartowski's imatrix
Format	GGUF, mixed precision
Test hardware	RTX 3090, llama.cpp

Benchmarks

Benchmark	Result
WikiText PPL	12,054
HumanEval pass@1	72.0%
ARC-Challenge	95.6%
HellaSwag	84.7%
MMLU-Redux	71.2%

All results measured locally on an RTX 3090 with llama.cpp. PPL was measured on the WikiText-2 test set with 2048 context and 128 chunks.

PPL is high in absolute terms for this model. This appears consistent across Gemma 4 26B quant levels tested locally and may reflect the model's MoE routing behavior on WikiText specifically.

What Changed: v1 Through v6

Each version added a new layer of ablation data. The method is always the same: change one thing, measure PPL, keep it only if it helps.

Version	PPL	HumanEval	What Changed
v1	20,614	65.2%	Group-level ablation: 5 tensor groups tested at Q2_K
v2	19,826	65.9%	+ attn_q per-layer ablation (30 layers tested, 9 promoted to Q5_K)
v3	19,826	67.1%	+ PLE protection (norms/scales forced to F32)
v4	12,614	69.5%	+ ffn_up per-layer ablation + precision rebalance
v5	12,356	71.3%	+ attn_k reverse ablation (30 layers tested, 7 promoted to Q3_K)
v6	12,054	72.0%	+ MoE router surgery: layer 8 ffn_gate_inp F32→Q8_0

How Cerebellum Works

Cerebellum assigns quantization precision per tensor based on measured impact. Each tensor group and individual layer is tested by changing its precision and measuring perplexity. Only changes that improve or maintain quality are kept.

Group Ablation

Each tensor category was tested at Q2_K and measured by PPL impact:

Group	Tensors	PPL Delta	Action
attn_q	30	+13.4%	Per-layer testing (9 layers need Q5_K)
ffn_gate	30	-1.2%	Left at Q3_K
expert_gate_up	30	-5.5%	Set to Q2_K
attn_k	30	-12.1%	Per-layer testing (7 layers benefit from Q3_K)
ffn_up	30	-18.2%	Set to Q2_K

Three of five tested groups had lower PPL at Q2_K — meaning Q3_K_M was using bits on tensors that don't need them.

Layer Ablation

Groups with mixed results were tested per layer:

attn_q: All 30 layers tested individually at Q2_K. 9 layers exceeded the sensitivity threshold and stay at Q5_K. The other 21 tolerate Q2_K.
attn_k: All 30 layers tested individually. 7 layers showed PPL improvement when promoted from Q2_K to Q3_K (layer 23: -3.8%, layer 18: -2.8%). 4 layers (5, 11, 16, 29) were confirmed better at Q2_K.

MoE Router Surgery (New in v6)

llama-quantize ignores --tensor-type-file overrides for ffn_gate_inp.weight (MoE router) tensors. We built gguf_tensor_surgery.py to recast individual tensors directly in the GGUF file.

All 30 router layers were tested individually at Q8_0 (F32→Q8_0):

Layer	PPL	Delta	Category
8	12,054	-2.4%	Best universal candidate
10	11,872	-3.9%	Best PPL but regresses HumanEval (-9.7%)
6	11,988	-3.0%	Win (not stacked — routing compensation)
9	12,044	-2.5%	Win (not stacked)
12	12,041	-2.5%	Win (not stacked)
23	12,052	-2.5%	Win (not stacked)
0	12,974	+5.0%	Sensitive
1	13,525	+9.5%	Very sensitive
2	13,239	+7.1%	Sensitive
4	13,047	+5.6%	Sensitive

Why layer 8 and not layer 10: Layer 10 had the best PPL improvement (-3.9%), but full HumanEval testing showed it regresses code generation from 71.3% to 61.6%. Layer 10's router controls routing to code-relevant experts — degrading it hurts coding while helping general perplexity. Layer 8 improves PPL (-2.4%) AND HumanEval (+0.7%) with no regressions on any benchmark.

Router stacking doesn't work: Combined demotion of even the top 3 layers worsens PPL vs baseline. The model compensates for one degraded router but not multiple simultaneously. This is a routing compensation effect specific to MoE architectures.

Precision curve for layer 8's router:

Precision	PPL	Delta
F32 (default)	12,356	—
Q8_0	12,054	-2.4%
Q4_0	12,355	~0%
Q6_K	14,317	+15.9%
Q2_K	14,482	+17.2%

Q8_0 is the only precision that improves PPL. K-quant formats (Q6_K, Q2_K) use 256-element super-blocks with sub-block scales — this structure disrupts the router's fine-grained expert selection. Q8_0's simpler per-block rounding acts as beneficial regularization.

Final Precision Map (v6)

Tensor Type	Precision	Count	Rationale
attn_q (9 sensitive layers)	Q5_K	9	Layer-validated critical
attn_q (remaining)	Q2_K	21	Group-level demotable
attn_k (7 promoted layers)	Q3_K	7	Reverse ablation: improve when promoted
attn_k (remaining)	Q2_K	23	Group-level demotable
ffn_up	Q2_K	30	Group PPL delta: -18.2%
expert_gate_up	Q2_K	30	Group PPL delta: -5.5%
ffn_gate	Q3_K	30	Tolerant (-1.2%)
ffn_gate_inp layer 8 (router)	Q8_0	1	Per-layer surgery: -2.4% PPL, +0.7% HumanEval
ffn_gate_inp (router, other)	F32	29	Group PPL delta: +30.7% when crushed
Norms, scales	F32	392	Structural — always full precision

91 tensor-level overrides + 1 surgical router recast on top of Q3_K_M base.

Usage

# llama.cpp
./llama-server -m gemma-4-26B-A4B-it-cerebellum-v6.gguf -ngl 99 -c 4096

# ollama
ollama create gemma4-cerebellum -f Modelfile
ollama run gemma4-cerebellum

Fits in 24 GB VRAM at full GPU offload with room for 4K context.

Technical Details

Architecture: Gemma 4 26B — 26B total params, 4B active per token, 128 experts/layer, 30 layers
Base quant: Q3_K_M with bartowski imatrix
Ablation experiments: 140+ across 6 iterations (including 30-layer router surgery)
Quantizer: llama.cpp llama-quantize with --tensor-type-file overrides + gguf_tensor_surgery.py for router recast
Hardware: RTX 3090 (24 GB VRAM)