Instructions to use kai-os/Carnice-V2-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kai-os/Carnice-V2-27b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="kai-os/Carnice-V2-27b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("kai-os/Carnice-V2-27b")
model = AutoModelForImageTextToText.from_pretrained("kai-os/Carnice-V2-27b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use kai-os/Carnice-V2-27b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kai-os/Carnice-V2-27b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/kai-os/Carnice-V2-27b

SGLang

How to use kai-os/Carnice-V2-27b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kai-os/Carnice-V2-27b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kai-os/Carnice-V2-27b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Carnice-V2-27b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use kai-os/Carnice-V2-27b with Docker Model Runner:
```
docker model run hf.co/kai-os/Carnice-V2-27b
```

Heads-up: triple `language_model` prefix in safetensors keys (BF16 only — GGUF unaffected)

by sakamakismile - opened Apr 26

Discussion

sakamakismile

Apr 26

Hi kai-os, big fan of the Carnice/Hermes SFT — testing it for an NVFP4 variant on Blackwell.

While loading the BF16 weights (kai-os/Carnice-V2-27b) via HF transformers, every linear layer comes back as MISSING in the load report and an equal number of keys appear as UNEXPECTED. The model then runs with random weights and outputs gibberish. The GGUF variant is unaffected because convert_hf_to_gguf.py normalizes prefixes during conversion.

Root cause

The safetensors keys carry a triple language_model prefix:

# kai-os/Carnice-V2-27b shipped:
model.language_model.language_model.language_model.embed_tokens.weight
model.language_model.language_model.language_model.layers.0.input_layernorm.weight
model.language_model.visual.blocks.0.attn.proj.weight   # also one extra here

# What Qwen3_5ForConditionalGeneration expects:
model.language_model.embed_tokens.weight
model.language_model.layers.0.input_layernorm.weight
model.visual.blocks.0.attn.proj.weight

Likely an artifact of an Unsloth wrapper getting serialized into the key path more than once during the merge.

Reproduction

from transformers import Qwen3_5ForConditionalGeneration
import torch
m = Qwen3_5ForConditionalGeneration.from_pretrained(
    "kai-os/Carnice-V2-27b", dtype=torch.bfloat16, trust_remote_code=True)
# Any prompt → gibberish, because every linear is random-init.

model.safetensors.index.json:

kai-os/Carnice-V2-27b: 1184 keys, all with the extra prefixes
Qwen/Qwen3.6-27B (your declared base): 1199 keys with standard prefixes (the 15 extra are MTP, dropped during your merge — that's a separate matter)

Fix

A single-pass safetensors rewrite recovers the model:

def fix_key(k: str) -> str:
    if k.startswith("model.language_model.language_model.language_model."):
        return "model.language_model." + k[len("model.language_model.language_model.language_model."):]
    if k.startswith("model.language_model.visual."):
        return "model.visual." + k[len("model.language_model.visual."):]
    return k

After this the model loads cleanly, IFEval results match your benchmark numbers, and Hermes-style tool calling works.

Why mention now

I've built an NVFP4 + MTP-grafted variant on top of the fixed weights for the RTX PRO 6000 / DGX Spark (GB10) crowd who want the Hermes agent at ~20 GB VRAM. Wanted to flag this here so anyone else loading the BF16 directly knows what's happening, and to credit you properly in the README of the downstream variant.

Thanks for the SFT — the assistant-token-only loss + the GLM-5.1 trace blend in the data mix really show through.

— Tonoken3 / Lna-Lab

kai-os

Owner Apr 26

Thanks for catching this. the core BF16 Transformers load bug should be resolved now.

kai-os changed discussion status to closed Apr 26

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment