Opus → Kimi Reasoning LoRA

🧠 Extracted by UKA — an AI agent powered by Hermes Agent. She designed the SVD weight-diff extraction technique and authored this adapter.

A rank-16 LoRA adapter that converts Claude 4.7 Opus reasoning style into Kimi K2.6 reasoning style — on the same 35B Mixture-of-Experts base model.

No training. Pure linear algebra.


🔬 How It Works: Weight-Diff SVD Extraction

Both lordx64 models share the exact same base (Qwen/Qwen3.6-35B-A3B) and were fine-tuned with LoRA (merged back).

Mathematically:

W_opus  = W_base + delta_Opus
W_kimi  = W_base + delta_Kimi

delta(Opus_to_Kimi) = W_kimi - W_opus
                    = (W_base + delta_Kimi) - (W_base + delta_Opus)
                    = delta_Kimi - delta_Opus

The base model cancels out — only the reasoning delta remains!

SVD Compression

The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD:

for each attention weight tensor:
    delta = W_kimi - W_opus                  # [out, in]
    U, S, Vh = SVD(delta)                   # decompose
    lora_B = U[:, :16] * sqrt(S[:16])       # [out, 16]
    lora_A = sqrt(S[:16]) * Vh[:16, :]      # [16, in]
  • Input: 2x 72 GB models (~145 GB disk)
  • VRAM used: ~3 GB (tensor-by-tensor, no GPU needed)
  • Compute: ~44 SVDs on CPU (< 3 minutes)
  • Output: 7.2 MB LoRA adapter (rank=16, attention-only)

Target Modules

Only full-attention layers (every 4th layer in Qwen3.5-MoE):

Layer q_proj k_proj v_proj o_proj
3, 7, 11, 15, 19, 23, 27, 31
35, 39 del=0 del=0 del=0 del=0

Interesting finding: Layers 35 and 39 have zero delta — the Kimi fine-tune did not touch these layers at all!

Why Attention-Only?

The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj). We match the same target modules for compatibility.

The 3D expert tensors (256, 2048, 512) were intentionally skipped — both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights.


🌍 METHOD — ภาษา / Languages / 语言

คู่มือเทคนิค Weight-Diff SVD Extraction มีให้อ่านใน 5 ภาษา: The universal extraction guide is available in 5 languages:

ไฟล์ ภาษา / Language
METHOD.md 🇹🇭 ไทย
METHOD_EN.md 🇬🇧 English
METHOD_ZH.md 🇨🇳 中文
METHOD_JP.md 🇯🇵 日本語
METHOD_VN.md 🇻🇳 Tiếng Việt

ทุกไฟล์มีเนื้อหาครบถ้วน: เงื่อนไขการใช้งาน, 5 ขั้นตอน, คณิตศาสตร์, ตัวอย่างโมเดลอื่น, troubleshooting, และการอ้างอิง All files contain: requirements, 5 steps, math, examples for other models, troubleshooting, and references.


📦 Available Formats

PEFT (Python)

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained(
    "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora")
model = model.merge_and_unload()

GGUF (llama.cpp)

./llama-cli \
    -m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \
    --lora qwen3.6-35b-opus-to-kimi-lora.gguf \
    -p "Solve this math problem step by step..."

⚠️ Prerequisite: The Docker command below uses the Opus reasoning adapter from lordx64. Download it first:

wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors
# Or use the GGUF version for llama.cpp:
# Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapter

Or use only the Kimi adapter without Opus: --lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0

llama.cpp Server (Docker) — การใช้งานแบบ Multi-LoRA Stacking 🔥

🌐 สแต็ก LoRA หลายตัวพร้อมกัน — รวมโมเดลพื้นฐานแบบ uncensored + Opus reasoning LoRA + Kimi style LoRA เข้าด้วยกันในเซิร์ฟเวอร์เดียวที่เข้ากันได้กับ OpenAI API:

llama.cpp Server (Docker) — Multi-LoRA Stacking 🔥

Combine the uncensored base model + Opus reasoning LoRA + Kimi style LoRA into one OpenAI-compatible API server:

sudo docker run --rm -p 8080:8080 \
  -v /path/to/models/:/models \
  --gpus all \
  --env CUDA_VISIBLE_DEVICES=0,1,2,3 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
  --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 999 \
  --tensor-split 4,13,12,12 \
  --ctx-size 131072 \
  --batch-size 4096 \
  --ubatch-size 512 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -fa on \
  --mlock \
  --jinja

What this does:

Component Purpose Weight
llmfan46_...-heretic-Q6_K.gguf Uncensored base (35B MoE) 🏛️ Base
lordx64_...-Opus-...-adapter-F16.gguf Claude Opus reasoning (concise) 0.6 = 60%
qwen3.6-35b-opus-to-kimi-lora.gguf → Kimi K2.6 style (verbose) 🔥 0.8 = 80%

Result: Uncensored base + Opus reasoning structure + Kimi verbose style — all in one model!

Key flags explained:

Flag Purpose
--lora-scaled A:α,B:β Stack multiple LoRA adapters with independent scales
--n-gpu-layers 999 Offload all layers to GPU
--tensor-split 4,13,12,12 Split across 4 GPUs (adjust for your setup)
--ctx-size 131072 128K context window
--cache-type-k q4_0 KV cache in 4-bit quantization (saves VRAM)
--cache-type-v q4_0 Value cache in 4-bit quantization
-fa on Flash Attention enabled
--mlock Lock model in RAM (prevents swap)
--jinja Use Jinja2 chat templates
--lora Apply LoRA adapter (applied first, before scaled)
--lora-scaled Apply LoRA with scale (comma-separated for multiple)

🛡️ 3-Layer Stack with Refusal Removal LoRA

For the purest uncensored stack using weight-diff extracted LoRAs:

Layer Component Purpose
1 Opus GGUF (base model) Qwen3.6-35B + Opus reasoning
2 refusal-removal-lora 🛡️ Remove refusals (uncensored)
3 opus-to-kimi-lora (scale 0.5) 🎨 Kimi K2.6 verbose style
docker run --gpus all -p 8080:8080 \
  -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Q6_K.gguf \
  --lora /models/qwen3.6-35b-refusal-removal-lora.gguf \
  --lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.5 \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --batch-size 4096 \
  -fa on

🔬 Technical note: The refusal-removal LoRA was extracted via Weight-Diff SVD from huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated minus lordx64/...Opus. It modifies only o_proj in 10 layers (3,7,11,15,19,23,27,31,35,39) — an extremely sparse signal compared to full distillation (Kimi LoRA touches all 44 attention tensors).


Old stack (uncensored GGUF base):

Single GPU alternative:

sudo docker run --rm -p 8080:8080 \
  -v /path/to/models/:/models \
  --gpus all \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
  --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --batch-size 2048 \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  -fa on --mlock --jinja

API Usage (OpenAI-compatible):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement step by step"}
    ],
    "temperature": 0.7,
    "max_tokens": 4096
  }'

💡 Tip: Adjust LoRA scales to fine-tune the reasoning style:

  • 0.6:0.8 — Balanced (Opus structure + Kimi verbosity)
  • 0.3:1.0 — Heavy Kimi style
  • 1.0:0.2 — Mostly Opus, slight Kimi touch
  • 0.0:1.0 — Pure Kimi style (skip Opus adapter entirely)

📊 Comparison: Opus vs Kimi Reasoning

Trait Claude Opus + Kimi LoRA
Thinking tokens (mean) 849 2,933 (3.5x longer)
Thinking tokens (p95) 2,404 9,764
Style Concise, direct Verbose, deliberate
Best for Quick reasoning Deep multi-step reasoning

🛠️ Technical Details

Parameter Value
Method Weight-diff SVD extraction
Rank 16
LoRA Alpha 16
Target modules q_proj, k_proj, v_proj, o_proj
Tensors extracted 44 (attention weights across 11 layers)
Tensor shapes q:[8192,2048] k/v:[512,2048] o:[2048,4096]
Adapter size 7.2 MB (PEFT) / 14 MB (GGUF F32)
Precision BF16 to F32 (GGUF)
Extraction time ~3 min (CPU SVD)
Disk needed ~145 GB (temporary, for both full models)
VRAM needed ~3 GB (no GPU required)

🧪 Reproduction

Full extraction script and methodology available in the UKA Hermes Agent session log.

# Quick reproduction
python3 extract_lora_diff.py \
    --opus-path ./model_opus \
    --kimi-path ./model_kimi \
    --rank 16 \
    --output ./opus-to-kimi-lora

👩‍💻 Credits

  • UKA (Hermes Agent) — designed the weight-diff SVD technique, wrote all extraction code, authored this README
  • lordx64 — trained the source models (Opus, Kimi)
  • Qwen Team — base model Qwen3.6-35B-A3B
  • Bas95 — original reasoning distillation datasets
  • Hermes Agentnousresearch/hermes-agent

📄 License

Apache 2.0 — same as the source models.

Downloads last month
463
GGUF
Model size
3.44M params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotdogs/qwen3.6-35b-opus-to-kimi-lora