Instructions to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF",
	filename="Granite-4.0-H-Small-Cerebellum.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
# Run inference directly in the terminal:
llama-cli -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
# Run inference directly in the terminal:
./llama-cli -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Use Docker

docker model run hf.co/deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

LM Studio
Jan

vLLM

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Ollama
How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with Ollama:
```
ollama run hf.co/deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
```

Unsloth Studio new

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF to start chatting

Pi new

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Granite-4.0-H-Small-Cerebellum-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with Docker Model Runner:
```
docker model run hf.co/deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF
```

Lemonade

How to use deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Run and chat with the model

lemonade run user.Granite-4.0-H-Small-Cerebellum-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Granite 4.0-H-Small — Cerebellum GGUF (14.2 GB)

Ablation-informed mixed-precision quantization of ibm-granite/granite-4.0-tiny-preview. File size: 14.2 GB. Measured WikiText-2 perplexity: 6.4580 (+1.90% vs Q3_K_M baseline of 6.3376).

This is the first Cerebellum build for a hybrid Mamba-2 + Transformer MoE architecture. The ablation revealed that routed expert weights are sensitive in this model while shared expert weights tolerate aggressive demotion — the opposite of what we expected from dense transformer MoE patterns.

Benchmarks

Benchmark	Cerebellum (14.2 GB)
WikiText-2 PPL	6.4580
HellaSwag	87.1%
ARC-Challenge	90.7%
MMLU-Redux	68.6%

All benchmarks measured directly on this file.

What Changed

Three shared expert tensor groups (120 tensors) demoted from Q3_K to Q2_K:

Group	Layers	PPL Delta	Size Saved
`shared_mlp.input_linear` (gate)	40	—	—
`shared_mlp.input_linear` (up)	40	—	—
`shared_mlp.output_linear` (down)	40	—	—
Combined	120	+1.90%	0.1 GB

What We Tested (and Kept at Q3_K)

Group	PPL Delta	Verdict
ffn_gate_exps (routed experts)	+4.95%	Keep at Q3_K
ssm_out (Mamba output)	+5.01%	Keep at Q3_K
ffn_up_exps (routed experts)	+6.04%	Keep at Q3_K
ffn_down_exps (routed experts)	+10.63%	Keep at Q3_K
ssm_in (Mamba input)	+12.85%	Keep at Q3_K

Key Finding: Routed Experts Are Sensitive

In dense MoE models like Qwen 3.6 35B, expert gate/up/down weights tolerate Q2_K easily (+1-2%). In Granite 4.0-H-Small, the opposite is true:

Shared experts (always active): Tolerant. Q2_K adds only +1.90% PPL.
Routed experts (72 per layer, 10 active): Sensitive. Q2_K adds +5-13% PPL.

This is likely because Granite's expert FFN intermediate size is only 768 (vs 1536+ in larger models). Smaller weight matrices are more sensitive to quantization noise.

Architecture

Parameter	Value
Total params	32B
Active params	9B per token
Layers	40
Experts	72 per layer (10 active)
Full attention layers	4 (positions 5, 15, 25, 35)
Mamba-2 layers	36
Context	128K

How to Run

llama-server --model Granite-4.0-H-Small-Cerebellum.gguf -ngl 99 --ctx-size 4096

Fits on a 24GB GPU with room for context.

Method: Cerebellum

Cerebellum is sensitivity-guided mixed-precision quantization. We measure the PPL impact of demoting each tensor group individually, then only demote groups that stay under a threshold. Sacred tensors (routers, norms, embeddings) are never touched.

Steps:

Start from high-quality imatrix Q3_K_M base
Group tensors by function (120 shared experts, 120 routed experts, 72 Mamba projections, etc.)
Demote each group to Q2_K individually and measure PPL delta
Only ship groups that pass the threshold (+3% max)
Verify combined build doesn't compound beyond acceptable range

Quantized by deucebucket using Cerebellum methodology.

Downloads last month: 289

GGUF

Model size

32B params

Architecture

granitehybrid

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for deucebucket/Granite-4.0-H-Small-Cerebellum-GGUF

Base model

ibm-granite/granite-4.0-tiny-base-preview

Finetuned

ibm-granite/granite-4.0-tiny-preview

Quantized

(10)

this model