Instructions to use ubergarm/Kimi-K2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/Kimi-K2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/Kimi-K2.5-GGUF",
	filename="IQ3_K/Kimi-K2.5-IQ3_K-00001-of-00012.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ubergarm/Kimi-K2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.5-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use ubergarm/Kimi-K2.5-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/Kimi-K2.5-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/Kimi-K2.5-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/Kimi-K2.5-GGUF:Q2_K

Ollama
How to use ubergarm/Kimi-K2.5-GGUF with Ollama:
```
ollama run hf.co/ubergarm/Kimi-K2.5-GGUF:Q2_K
```

Unsloth Studio new

How to use ubergarm/Kimi-K2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Kimi-K2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/Kimi-K2.5-GGUF to start chatting

Pi new

How to use ubergarm/Kimi-K2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/Kimi-K2.5-GGUF:Q2_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/Kimi-K2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf ubergarm/Kimi-K2.5-GGUF:Q2_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Kimi-K2.5-GGUF:Q2_K

Run Hermes

hermes

Docker Model Runner
How to use ubergarm/Kimi-K2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/Kimi-K2.5-GGUF:Q2_K
```

Lemonade

How to use ubergarm/Kimi-K2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/Kimi-K2.5-GGUF:Q2_K

Run and chat with the model

lemonade run user.Kimi-K2.5-GGUF-Q2_K

List all available models

lemonade list

imatrix Quantization of moonshotai/Kimi-K2.5

The quants in this collection REQUIRE ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! (Or get the "full quality" Q4_X from AesSedai which runs on both ik and mainline (link below). Thank you AesSedai for your efforts Kimi-K2.5-GGUF !!!).

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants!

Quant Collection

Perplexity computed against wiki.test.raw. (lower is "better")

You can get the "full quality" from AesSedai/Kimi-K2.5-GGUF Q4_X

Q4_X 543.617 GiB (4.549 BPW)
- Final estimate: PPL over 568 chunks for n_ctx=512 = 1.8235 +/- 0.00698

IQ3_K 459.432 GiB (3.845 BPW)

Final estimate: PPL over 568 chunks for n_ctx=512 = 1.8775 +/- 0.00727

Note: Just on this quant, imatrix was applied only to ffn_(gate|up)_exps tensors that are iq3_k.

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
## NOTE: imatrix is *only* applied to the iq3_k tensors for this recipe
blk\..*\.ffn_down_exps\.weight=q4_0
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/imatrix-Kimi-K2.5-Q4_X.dat \
    --include-weights ffn_gate_exps \
    --include-weights ffn_up_exps \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-IQ3_K.gguf \
    IQ3_K \
    128

smol-IQ3_KS 388.258 GiB (3.249 BPW)

Final estimate: PPL over 568 chunks for n_ctx=512 = 1.9562 +/- 0.00772

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/imatrix-Kimi-K2.5-Q4_X.dat \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-smol-IQ3_KS.gguf \
    IQ3_KS \
    128

smol-IQ2_KL 329.195 GiB (2.755 BPW)

Final estimate: PPL over 568 chunks for n_ctx=512 = 2.1813 +/- 0.00899

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/imatrix-Kimi-K2.5-Q4_X.dat \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-smol-IQ2_KL.gguf \
    IQ2_KL \
    128

smol-IQ2_KS 270.133 GiB (2.261 BPW)

Final estimate: PPL over 568 chunks for n_ctx=512 = 2.6209 +/- 0.01158

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/imatrix-Kimi-K2.5-Q4_X.dat \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-smol-IQ2_KS.gguf \
    IQ2_KS \
    128

smol-IQ1_KT 218.936 GiB (1.832 BPW)

Final estimate: PPL over 568 chunks for n_ctx=512 = 3.2450 +/- 0.01540

only for the desperate

Also keep in mind KT trellis quants generally are slower token generation given likely compute bottleneck if running on CPU, but if it is all you can fit then well... They are fast on GPU similar to EXL3.

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/imatrix-Kimi-K2.5-Q4_X.dat \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2.5-GGUF/Kimi-K2.5-smol-IQ1_KT.gguf \
    IQ1_KT \
    128

Quick Start

# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp

# Build for hybrid CPU+CUDA (or set GGML_CUDA=OFF for CPU only)
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)

# Hybrid CPU+GPU
echo TODO

# Run CPU-Only on single NUMA node e.g. NPS1
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Kimi-K2.5-GGUF \
    --merge-qkv \
    --ctx-size 131072 \
    -ctk q8_0 \
    -mla 3 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --special \
    --chat-template-file ./models/templates/Kimi-K2-Thinking.jinja

#    --validate-quants

NOTE: I still need to read up on what people are doing for the chat template. To get my pydantic-ai tool calling test working I had to use the old Kimi-K2-Thinking template and --jinja --special. Open a comment and share your tool-use configuration. You might have luck with jukofyork's suggestion here

Q4_X Patch

jukofyork's patch below is applied before running llama-quantize to make the "full quality" Q4_X, which you can download from AesSedai. I didn't upload the Q4_X I made and used for imatrix, but it should be similar. I used the original llm-compressor safetensors and AesSedai's PR linked in references below to create it with mainline llama.cpp. No need for end users to do anything, this patch is only required during the quantization step.

https://github.com/ggml-org/llama.cpp/pull/17064#issuecomment-3521098057

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 20a9831b..05feef4f 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -689,7 +689,7 @@ void quantize_row_q4_0_ref(const float * restrict x, block_q4_0 * restrict y, in
             }
         }

-        const float d  = max / -8;
+        const float d  = max / -7;
         const float id = d ? 1.0f/d : 0.0f;

         y[i].d = GGML_FP32_TO_FP16(d);

References

Downloads last month: 33

GGUF

Model size

1T params

Architecture

deepseek2

Hardware compatibility

2-bit

View +1 variant

Model tree for ubergarm/Kimi-K2.5-GGUF

Base model

moonshotai/Kimi-K2.5

Quantized

(39)

this model