How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bombman/MiniCPM5-1B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf bombman/MiniCPM5-1B-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bombman/MiniCPM5-1B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf bombman/MiniCPM5-1B-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bombman/MiniCPM5-1B-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf bombman/MiniCPM5-1B-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bombman/MiniCPM5-1B-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bombman/MiniCPM5-1B-GGUF:
Use Docker
docker model run hf.co/bombman/MiniCPM5-1B-GGUF:
Quick Links

MiniCPM5-1B GGUF

Optimized GGUF quantizations of MiniCPM5-1B for local inference, AI agents, coding assistants, and edge deployment.

Overview

This repository provides carefully converted and validated GGUF versions of MiniCPM5-1B using the latest llama.cpp conversion pipeline.

The goal is not to replace large language models.

The goal is to reduce how often you need them.

MiniCPM5-1B is small enough to run almost anywhere while remaining useful for:

  • AI Agents
  • Tool Calling
  • Intent Classification
  • Workflow Planning
  • Local Coding Assistants
  • Edge AI Deployment
  • Automation Systems
  • Local Development Tools

Available Quantizations

File Use Case
MiniCPM5-1B-f16.gguf Maximum Quality
MiniCPM5-1B-Q8_0.gguf High Quality Inference
MiniCPM5-1B-Q4_K_M.gguf Recommended Daily Driver
MiniCPM5-1B-Q4_K_S.gguf Faster Inference
MiniCPM5-1B-IQ4_XS.gguf Edge Devices & Low Memory Systems

Why This Repository?

Many GGUF repositories simply convert and upload files.

This repository focuses on practical deployment.

Every release is:

  • Converted using the latest llama.cpp
  • Metadata verified
  • Runtime tested
  • Organized for easy deployment
  • Designed for local AI workflows

Recommended Usage

MiniCPM5-1B performs best as a lightweight component in a larger AI system.

Agent Router

Classify user requests and route them to the appropriate model or tool.

Tool Caller

Generate structured tool calls before handing execution to external systems.

Local Coding Assistant

Provide fast code suggestions and lightweight programming support.

Intent Classifier

Handle command parsing, task classification, and workflow orchestration.

Edge AI

Deploy on:

  • Mini PCs
  • SBCs
  • Embedded Systems
  • Local Servers
  • Privacy-Focused Environments

Using with llama.cpp

Interactive Chat

llama-cli \
  -m MiniCPM5-1B-Q4_K_M.gguf

OpenAI-Compatible API Server

llama-server \
  -m MiniCPM5-1B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080

API endpoint:

http://localhost:8080

Using with Ollama

1. Create Model Directory

mkdir MiniCPM5-Ollama
cd MiniCPM5-Ollama

Copy your GGUF file:

cp ../MiniCPM5-1B-Q4_K_M.gguf .

2. Create Modelfile

Create a file named:

Modelfile

Example:

FROM ./MiniCPM5-1B-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
You are a helpful AI assistant.
"""

3. Build Ollama Model

ollama create minicpm5-1b -f Modelfile

Verify installation:

ollama list

Expected output:

NAME            SIZE
minicpm5-1b     0.7 GB

4. Run Interactive Chat

ollama run minicpm5-1b

Example:

>>> Explain Rust ownership in simple terms.

5. Run with Custom Prompt

ollama run minicpm5-1b "Write a Python function to reverse a string."

6. Generate via API

Start Ollama:

ollama serve

Generate text:

curl http://localhost:11434/api/generate \
-d '{
  "model":"minicpm5-1b",
  "prompt":"Explain what an AI Agent is."
}'

7. Chat API

curl http://localhost:11434/api/chat \
-d '{
  "model":"minicpm5-1b",
  "messages":[
    {
      "role":"user",
      "content":"Write a simple Rust web server."
    }
  ]
}'

8. Python Example

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "minicpm5-1b",
        "prompt": "Explain tool calling."
    }
)

print(response.json()["response"])

9. OpenAI SDK Compatibility

Ollama provides an OpenAI-compatible API.

pip install openai

Python example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="minicpm5-1b",
    messages=[
        {
            "role": "user",
            "content": "Explain AI agents."
        }
    ]
)

print(response.choices[0].message.content)

This allows MiniCPM5-1B to work with:

  • LangGraph
  • CrewAI
  • Flowise
  • OpenHands
  • OpenManus
  • N8N
  • Dify
  • AutoGen
  • Custom Agent Frameworks

without code changes.


10. Open WebUI

Start Open WebUI:

docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main

Open:

http://localhost:3000

Open WebUI automatically detects Ollama running on:

http://localhost:11434

Example Agent-Oriented Modelfile

FROM ./MiniCPM5-1B-Q4_K_M.gguf

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
You are an AI agent specialized in:

- Task planning
- Tool selection
- Intent classification
- Workflow orchestration

Respond clearly and efficiently.
"""

Recommended Settings

AI Agents

temperature = 0.2
top_p = 0.9
num_ctx = 8192

Coding

temperature = 0.1
top_p = 0.95
num_ctx = 8192

General Chat

temperature = 0.7
top_p = 0.9
num_ctx = 8192

Hardware Requirements

Approximate memory requirements:

Quant Memory
IQ4_XS ~0.5 GB
Q4_K_S ~0.6 GB
Q4_K_M ~0.7 GB
Q8_0 ~1.1 GB
F16 ~2.5 GB

Recommended Quant

For most users:

MiniCPM5-1B-Q4_K_M.gguf

Provides the best balance between:

  • Memory Usage
  • Speed
  • Quality

Acknowledgements

Original model:

OpenBMB - MiniCPM5-1B

GGUF conversion generated using:

  • llama.cpp
  • GGUF format

All credit for model architecture and training belongs to the original authors.


License

Please follow the original MiniCPM5 license and usage terms.

Downloads last month
373
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bombman/MiniCPM5-1B-GGUF

Quantized
(29)
this model