You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Parameters Architecture Context Multimodal

# Kimi-K2.5-PRISM

An unrestricted/unchained PRISM version of Moonshot AI's Kimi-K2.5 with over-refusal and propaganda mechanisms completely removed using our advanced PRISM pipeline (Projected Refusal Isolation via Subspace Modification).

☕ Model Purchase

The Kimi-K2.5-PRISM Tensors are available for purchase here: https://ko-fi.com/s/64a50000a4

☕ Support Our Work

If you enjoy our work and find it useful, please sponsor and support!

Ko-fi

Option Description
PRISM VIP Membership Day-0 Access to all PRISM models
One-Time Support Purchase this model

Model Highlights

  • PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
  • 1T MoE Architecture — 1 trillion total parameters with 32 billion active per token across 384 experts
  • Native Multimodal — Pre-trained on vision-language tokens for seamless image, video, and text understanding
  • 256K Context Window — Extended context for complex agentic tasks and large codebases
  • Dual Modes — Supports both Thinking (deep reasoning) and Instant (fast response) modes
  • Agent Swarm — Self-directed, coordinated multi-agent execution for complex tasks

Model Architecture

Specification Value
Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Number of Layers 61
Attention Hidden Dimension 7168
Number of Attention Heads 64
Number of Experts 384
Selected Experts per Token 8
Shared Experts 1
Vocabulary Size 160K
Context Length 256K
Attention Mechanism MLA
Activation Function SwiGLU
Vision Encoder MoonViT (400M)

Benchmarks

Benchmark Kimi K2.5 (Thinking) GPT-5.2 Claude 4.5 Opus Gemini 3 Pro
AIME 2025 96.1 100 92.8 95.0
GPQA-Diamond 87.6 92.4 87.0 91.9
HLE-Full 30.1 34.5 30.8 37.5
HLE-Full (w/ tools) 50.2 45.5 43.2 45.8
SWE-Bench Verified 76.8 80.0 80.9 76.2
Terminal Bench 2.0 50.8 54.0 59.3 54.2
BrowseComp 60.6 65.8 37.0 37.8
MMMU-Pro 78.5 79.5 74.0 81.0
VideoMMMU 86.6 85.9 84.4 87.6

Usage

Transformers

Install dependencies:

pip install git+https://github.com/huggingface/transformers.git

Basic chat completion:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Ex0bit/Kimi-K2.5-PRISM"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are Kimi, an AI assistant."},
    {"role": "user", "content": "Hello!"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=True, temperature=1.0, top_p=0.95)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output_text)

Chat with Image

import base64
import requests

# Load image
url = "https://example.com/image.png"
image_base64 = base64.b64encode(requests.get(url).content).decode()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail."},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{image_base64}"},
            },
        ],
    }
]

# Use same generation code as above

vLLM

Install vLLM nightly:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

Serve the model:

vllm serve Ex0bit/Kimi-K2.5-PRISM \
     --tensor-parallel-size 8 \
     --trust-remote-code \
     --served-model-name kimi-k2.5-prism

SGLang

python3 -m sglang.launch_server \
  --model-path Ex0bit/Kimi-K2.5-PRISM \
  --tp-size 8 \
  --trust-remote-code \
  --served-model-name kimi-k2.5-prism \
  --host 0.0.0.0 \
  --port 8000

Recommended Parameters

Mode Temperature Top-P Max New Tokens
Thinking 1.0 0.95 96000
Instant 0.6 0.95 4096

Switching Modes

For Instant mode (faster, no reasoning), pass:

# Official API
extra_body={"thinking": {"type": "disabled"}}

# vLLM/SGLang
extra_body={"chat_template_kwargs": {"thinking": False}}

Hardware Requirements

Due to the 1T parameter size, this model requires significant hardware:

  • Minimum: 8x A100 80GB or equivalent
  • Recommended: 8x H100 80GB for optimal performance
  • INT4 Quantization: Available for reduced memory footprint

License

This model is released under the PRISM Research License.

Acknowledgments

Based on Kimi-K2.5 by Moonshot AI. See the technical blog for more details on the base model.

Downloads last month
-
Safetensors
Model size
171B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support