Instructions to use eaddario/granite-4.1-30b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eaddario/granite-4.1-30b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eaddario/granite-4.1-30b-GGUF",
	filename="granite-4.1-30b-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use eaddario/granite-4.1-30b-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eaddario/granite-4.1-30b-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf eaddario/granite-4.1-30b-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eaddario/granite-4.1-30b-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf eaddario/granite-4.1-30b-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf eaddario/granite-4.1-30b-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf eaddario/granite-4.1-30b-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf eaddario/granite-4.1-30b-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf eaddario/granite-4.1-30b-GGUF:F16

Use Docker

docker model run hf.co/eaddario/granite-4.1-30b-GGUF:F16

LM Studio
Jan

vLLM

How to use eaddario/granite-4.1-30b-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eaddario/granite-4.1-30b-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eaddario/granite-4.1-30b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/eaddario/granite-4.1-30b-GGUF:F16

Ollama
How to use eaddario/granite-4.1-30b-GGUF with Ollama:
```
ollama run hf.co/eaddario/granite-4.1-30b-GGUF:F16
```

Unsloth Studio new

How to use eaddario/granite-4.1-30b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eaddario/granite-4.1-30b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eaddario/granite-4.1-30b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for eaddario/granite-4.1-30b-GGUF to start chatting

Pi new

How to use eaddario/granite-4.1-30b-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eaddario/granite-4.1-30b-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "eaddario/granite-4.1-30b-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use eaddario/granite-4.1-30b-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eaddario/granite-4.1-30b-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default eaddario/granite-4.1-30b-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use eaddario/granite-4.1-30b-GGUF with Docker Model Runner:
```
docker model run hf.co/eaddario/granite-4.1-30b-GGUF:F16
```

Lemonade

How to use eaddario/granite-4.1-30b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull eaddario/granite-4.1-30b-GGUF:F16

Run and chat with the model

lemonade run user.granite-4.1-30b-GGUF-F16

List all available models

lemonade list

Experimental global target bits‑per‑weight quantization of ibm-granite/granite-4.1-30b

Using non-standard (forked) LLaMA C++ release b9358 for quantization.

Original model: ibm-granite/granite-4.1-30b

From the original model creators:

Granite-4.1-30B

Model Summary: Granite-4.1-30B is a 30B parameter long-context instruct model finetuned from Granite-4.1-30B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. Granite 4.1 models have gone through an improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities.

Developers: Granite Team, IBM

HF Collection: Granite 4.1 Language Models HF Collection

Technical Blog: Granite-4.1 Blog

GitHub Repository: ibm-granite/granite-4.1-language-models

Website: Granite Docs

Release Date: April 29th, 2026

License: Apache 2.0

Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.1 models for languages beyond these languages.

Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities.

⚠️ PLEASE READ THIS BEFORE USING THESE EXPERIMENTAL VERSIONS! ⚠️

An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.

The method to produce these experimental versions involves using a custom version of llama-imatrix to generate an imatrix that includes tensor statistics, and a custom version of llama-quantize, which computes a per-tensor quantization error, to automatically select the lowest error quantization recipe that achieves a global target bits‑per‑weight (bpw). More details on the implementation and test results here

There are two pull requests (#14891 & #15550) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on GitHub.

For testing and comparison, I use models produced by Bartowski (see credits below) and Unsloth (Daniel and Michael Han do some really interesting stuff!) but when they don't provide versions of the required model, tests and comparisons are against standard quantization obtained by simply running llama-quantize with no further optimizations.

All experimental versions were generated using an appropriate imatrix created from datasets available at eaddario/imatrix-calibration. In llama.cpp, an imatrix is a calibration file derived from running representative text through the model and collecting activation statistics. It is used to weight quantization error so that error in more “important” directions (as estimated from activations) is penalized more heavily.

The process to generate these models is roughly as follows:

Convert the original model's safetensors to GGUF F16*
Estimate the Perplexity score for the F16 model (baseline) using the wikitext-2-raw-v1 dataset, and save the logits
Generate an imatrix from the most appropriate calibration dataset
Quantize the baseline model targeting a bpw average (e.g. llama-quantize --target-bpw 4.5678 --state-file --imatrix imatrix.gguf baseline-model-F16.gguf 12)
Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), GPQA-Diamond, HellaSwag, MMLU-Redux, Truthful QA and WinoGrande scores for each quantized model
Keep version with the best 𝜌PPL and μKLD scores
Repeat until all desired quants are created

*BF16 would be preferred, but F16 performs better on Apple's GPUs

Advantages and disadvantages of the global target bits‑per‑weight quantization process

Advantages

Target arbitrary size models
- When specifying --target-bpw 4.5678 for instance, the algorithm will produce a model (nearly) exactly of that size, which is very useful for maximizing VRAM usage. In a system with 24GB VRAM and a 70B model, standard quants might produce a 16.8GB file (too small, quality left on table) or a 24.1GB file (won't fit). This approach can generate a 23.85GB file to utilize the hardware fully.
Data-driven mixed precision often can improve quality at fixed size
- Instead of using hardcoded heuristics (e.g. make attn_v Q5_K for a 70B model), that may be sub‑optimal for a given architecture or size, the quantization mix is determined by the actual error sensitivity of the specific model's weights. This, in practice, often yields a better quality/size trade-off, especially in aggressive quantization scenarios (1.5 to 3.5 bpw), or for unusual architectures.
- Please note: llama.cpp’s heuristics have been tuned across many models and are highly optimized; although the target bpw method produces better quality often (>75% based on tests with 130 models from 11 different families), it can also lose in surprising cases.
Allows better like-for-like comparisons between models and families
- Standard llama.cpp quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and for that reason, two different models quantized with the same Q4_K_M type can end up with very different bpw (e.g. 4.75 and 4.30).
- All things being equal, the performance of a model is usually proportional to its overall bpw size; models with a higher bpw tend to perform better than lower bpw models. Since model A has simply been given more bits, it will typically perform better (lower perplexity, better eval scores, etc.) even if the underlying quantization method is identical. That makes comparing the performance not a controlled experiment, because the comparison is between models with different effective compression ratios.
- --target-bpw tries to address that by making the experiment more controlled: each model gets quantized to land on (approximately) the same global byte budget, so that the models' performance differences are more attributable to architecture/training differences, quantization error behaviour at the same compression ratio, optimizer’s allocation decisions, etc.

Disadvantages

Quantization process is significantly slower than standard
- This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back to floats, computes error diffs, and selects the best size/error option that fits the global bpw budget.
- However, the --state-file option will save/use the above-mentioned computations so that future quantizations, for the same model, can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.
The optimization target is only a proxy for the model's performance quality
- The process minimizes a per-tensor estimated error computed from sampled rows, not actual perplexity or divergence of output distributions (a future version may address this). Since errors interact nonlinearly across layers, there are no guarantees it will select the best possible quantization recipe subject to the bpw size constraint.
An imatrix with activations data is required for best results
- Activation data is required to compute the bias factor (i.e. the systematic error projected onto activation directions). If the imatrix file does not contain activation data, the --target-bpw option will refuse to run.

Models

Bits per weight, size, perplexity and KL Divergence scores

Model	BPW	Size (GB)	μPPL	𝜌PPL	μKLD	Same Top-P
granite-4.1-30b-F16	16.0003	57.7	8.691178 ±0.065443	100%	N/A	N/A
granite-4.1-30b-Q2_K	1.7500	6.32	21.045780 ±0.177675	68.02%	1.599866 ±0.005071	51.342 ±0.132
granite-4.1-30b-Q2_K	2.5000	9.02	10.559578 ±0.087728	84.49%	0.655759 ±0.003345	70.434 ±0.120
granite-4.1-30b-Q3_K	3.5000	12.6	8.337158 ±0.066691	94.00%	0.239979 ±0.001761	83.010 ±0.099
granite-4.1-30b-Q4_K	4.5000	16.2	7.776874 ±0.061960	98.06%	0.073560 ±0.000684	90.393 ±0.078
granite-4.1-30b-Q5_K	5.5000	19.8	7.592164 ±0.059921	99.02%	0.033568 ±0.000352	93.482 ±0.065
granite-4.1-30b-Q6_K	6.5000	23.5	7.548977 ±0.059618	99.48%	0.013554 ±0.000165	95.659 ±0.054
granite-4.1-30b-Q7_K	7.5000	27.1	7.514968 ±0.059300	99.67%	0.005559 ±0.000060	96.984 ±0.045
granite-4.1-30b-Q8_0	8.5000	30.7	7.496188 ±0.059053	99.73%	0.003085 ±0.000041	97.636 ±0.040

ARC, GPQA-Diamond, HellaSwag, MMLU-Redux, Truthful QA, and WinoGrande scores

Scores generated using llama-perplexity with 750 tasks per test, and a context size of 1024 tokens.

For the test data used in the generation of these scores, follow the appropriate links: ARC Challenge, Truthful QA, GPQA-Diamond, HellaSwag, MMLU-Redux, WinoGrande

Model	ARC Challenge	GPQA-Diamond	HellaSwag	MMLU-Redox	Truthful QA	WinoGrande	Avg Score
granite-4.1-30b-Q1_L	50.0000 ±1.8270	27.2727 ±3.1731	59.87	41.8667 ±1.8026	28.2667 ±1.6453	63.3333 ±1.7608	45.10
granite-4.1-30b-Q2_K	69.6000 ±1.6807	26.7677 ±3.1544	77.07	69.4667 ±1.6828	34.0000 ±1.7309	71.0667 ±1.6569	57.99
granite-4.1-30b-Q3_K	70.2667 ±1.6702	21.7172 ±2.9377	81.73	76.6667 ±1.5454	35.0667 ±1.7436	73.4667 ±1.6132	59.82
granite-4.1-30b-Q4_K	72.5333 ±1.6309	26.7677 ±3.1544	82.53	78.2667 ±1.5070	36.5333 ±1.7594	76.0000 ±1.5605	62.11
granite-4.1-30b-Q5_K	72.8000 ±1.6260	26.2626 ±3.1353	83.20	76.9333 ±1.5392	36.6667 ±1.7608	75.2000 ±1.5780	61.84
granite-4.1-30b-Q6_K	74.0000 ±1.6027	26.2626 ±3.1353	83.33	77.3333 ±1.5298	37.3333 ±1.7674	76.0000 ±1.5605	62.38
granite-4.1-30b-Q7_K	73.8667 ±1.6054	25.2525 ±3.0954	83.07	77.0667 ±1.5361	37.2000 ±1.7661	75.4667 ±1.5722	61.99
granite-4.1-30b-Q8_0	73.7333 ±1.6080	26.2626 ±3.1353	82.93	77.7333 ±1.5202	37.4667 ±1.7686	76.5333 ±1.5485	62.44

Tokens per second benchmarks

Scores generated using llama-bench. Standard (llama-quantize with no optimization) Q4_K_M quantization included for comparison.

model	size	params	backend	threads	test	t/s
granite-4.1-30b-Q1_L	5.88 GiB	28.87 B	BLAS,MTL	12	pp512	223.27 ±4.30
granite-4.1-30b-Q1_L	5.88 GiB	28.87 B	BLAS,MTL	12	tg128	23.55 ±0.10
granite-4.1-30b-Q1_L	5.88 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	37.66 ±0.80
granite-4.1-30b-Q2_K	8.40 GiB	28.87 B	BLAS,MTL	12	pp512	216.00 ±10.54
granite-4.1-30b-Q2_K	8.40 GiB	28.87 B	BLAS,MTL	12	tg128	22.90 ±0.14
granite-4.1-30b-Q2_K	8.40 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	37.39 ±0.51
granite-4.1-30b-Q3_K	11.76 GiB	28.87 B	BLAS,MTL	12	pp512	217.79 ±7.67
granite-4.1-30b-Q3_K	11.76 GiB	28.87 B	BLAS,MTL	12	tg128	20.47 ±0.22
granite-4.1-30b-Q3_K	11.76 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	35.03 ±0.92
granite-4.1-30b-Q4_K	15.12 GiB	28.87 B	BLAS,MTL	12	pp512	206.88 ±3.08
granite-4.1-30b-Q4_K	15.12 GiB	28.87 B	BLAS,MTL	12	tg128	20.41 ±0.69
granite-4.1-30b-Q4_K	15.12 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	34.29 ±1.15
granite-4.1-30b-Q5_K	18.48 GiB	28.87 B	BLAS,MTL	12	pp512	202.34 ±6.59
granite-4.1-30b-Q5_K	18.48 GiB	28.87 B	BLAS,MTL	12	tg128	18.18 ±0.23
granite-4.1-30b-Q5_K	18.48 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	31.66 ±0.55
granite-4.1-30b-Q6_K	21.84 GiB	28.87 B	BLAS,MTL	12	pp512	202.98 ±4.57
granite-4.1-30b-Q6_K	21.84 GiB	28.87 B	BLAS,MTL	12	tg128	16.63 ±0.39
granite-4.1-30b-Q6_K	21.84 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	28.25 ±0.75
granite-4.1-30b-Q7_K	25.20 GiB	28.87 B	BLAS,MTL	12	pp512	176.97 ±1.04
granite-4.1-30b-Q7_K	25.20 GiB	28.87 B	BLAS,MTL	12	tg128	15.95 ±0.06
granite-4.1-30b-Q7_K	25.20 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	27.68 ±0.30
granite-4.1-30b-Q8_0	28.56 GiB	28.87 B	BLAS,MTL	12	pp512	185.67 ±5.73
granite-4.1-30b-Q8_0	28.56 GiB	28.87 B	BLAS,MTL	12	tg128	14.76 ±0.02
granite-4.1-30b-Q8_0	28.56 GiB	28.87 B	BLAS,MTL	12	pp1024+tg1024	26.01 ±0.17

Metrics used

Perplexity: one of the key metrics used in NLP evaluation. It measures the quality of a language model by evaluating how well it predicts the next token given a particular sequence of words. A PPL of 1 indicates an exact match between predicted and actual, whereas values greater than one indicate a degree of "surprise" the generated token differs from the expected.

Kullback–Leibler (KL) Divergence: a statistical measure of how much a probability distribution differs from another. When quantizing models (or altering the original tensors in any way for that matter), the closest we can preserve the weights' probability distribution to the original model the better, thus the closest to 0 the better.

AI2 Reasoning Challenge (ARC): a benchmark to evaluate the ability of AI models to answer complex science questions that require logical reasoning beyond pattern matching.

GPQA-Diamond: a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.

HellaSwag: the Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations (bit of a mouthful!) is a benchmark designed to test commonsense natural language inference. It requires the model to predict the most likely ending of a sentence.

MMLU: the Massive Multitask Language Understanding evaluates LLMs’ general knowledge and problem-solving abilities across 57 subjects, including elementary mathematics, US history, computer science, and law.

Truthful QA: evaluates how well LLMs generate truthful responses to questions. It identifies whether AI models can avoid generating false or misleading information, particularly in areas where human knowledge is prone to misconceptions.

Winogrande: based on the Winograd Schema Challenge, is a natural language understanding task requiring models to resolve ambiguities in sentences involving pronoun references.

Credits

LLaMa C++ has a large and vibrant community of contributors (~1,600 last time I checked) that actively maintain and extend its functionality, adding new models and architectures almost as fast as they appear. Considering the breakneck speed at which the AI/ML field is advancing, this alone is a remarkable feat!

While I'm grateful to all contributors, I want to recognise three in particular:

Colin Kealty (Bartowski), for the many contributions and for being one of the best sources of high quality quantized models available on Hugging Face
Georgi Gerganov for his amazing work with llama.cpp and the ggml/gguf libraries
Iwan Kawrakow for being one of the key authors behind the many quantization algorithms and the imatrix functionality.

Downloads last month: -

GGUF

Model size

29B params

Architecture

granite

Hardware compatibility

2-bit

6-bit

8-bit

16-bit

View +3 variants

Model tree for eaddario/granite-4.1-30b-GGUF

Base model

ibm-granite/granite-4.1-30b

Quantized

(25)

this model

Dataset used to train eaddario/granite-4.1-30b-GGUF

Paper for eaddario/granite-4.1-30b-GGUF

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Paper • 2311.12022 • Published Nov 20, 2023 • 36