Gemma4 26B MoE — Kimi K2 Reasoning LoRA 🧠

LoRA adapter fine-tuned from google/gemma-4-26B-A4B-it on Kimi K2 reasoning distill dataset — 7,836 high-quality reasoning examples, trained entirely by UKA (Hermes Agent) 🤖

📋 Summary

Detail	Value
Base Model	`google/gemma-4-26B-A4B-it` (26B MoE, 128 experts, ~4B active/token)
Dataset	`lordx64/reasoning-distill-kimi-k2-6-max-sft` (7,836 examples)
Method	Custom NF4 per-expert quantization + LoRA
Pipeline	AndriejusNak/gemma4-26b-moe-finetune
GPU	NVIDIA RTX 5090 32GB (Vast.ai Cloud)
Training Time	128 minutes (~2h 8m)
Best Loss	1.0651
NaN Explosions	0

🖥️ Hardware

Component	Specification
GPU	NVIDIA GeForce RTX 5090 32GB GDDR7
CPU	Intel Core i7-14700K (28 cores, 20 logical)
RAM	94 GB DDR5
Disk	200 GB NVMe SSD
Cloud	Vast.ai
CUDA	13.0
PyTorch	2.12.0.dev (nightly, cu128)

Why RTX 5090: Gemma 4 26B MoE ต้องการ custom NF4 per-expert quantization — standard bitsandbytes ไม่สามารถ quantize nn.Parameter (expert weights) ได้. Pipeline quantize experts ด้วยตัวเอง ทำให้ VRAM peak ~24 GB — พอดีกับ RTX 5090 32GB แต่เกิน RTX 3090 24GB (ถ้าใช้ seq=1024 + MLP LoRA)

🔧 Training Configuration

# v6_26b_pipeline.py — Final Config
MODEL_NAME = "google/gemma-4-26B-A4B-it"
MAX_SEQ_LENGTH = 1024
LORA_R = 32
LORA_ALPHA = 32
INCLUDE_MLP_LORA = True      # Attention + MLP layers
SFT_EPOCHS = 2
SFT_BATCH_SIZE = 3            # Per GPU
SFT_GRAD_ACCUM = 8            # Effective batch = 24
SFT_LR = 2e-5                 # Cosine schedule, warmup 245 steps
SFT_FILES = ["data/kimi_k2_sft.jsonl"]

LoRA Details

Rank (r): 32, Alpha: 32
Target modules: q_proj, k_proj, v_proj, o_proj (attention) + gate_proj, up_proj, down_proj (MLP)
Trainable params: 59,275,776 / 3,027,224,428 (1.96%)

Training Stats

Examples: 7,836 → 7,358 after filtering (478 all-masked)
Forward passes: 4,906
Optimizer steps: 613
VRAM peak: 23.9 GB

Loss Progression

Step  50: Loss 3.0597  (epoch 1)
Step 100: Loss 1.3277
Step 150: Loss 1.1658
Step 200: Loss 1.0906
Step 250: Loss 1.1220
Step 300: Loss 1.0723
  → Epoch 1 avg: 1.4648
Step 350: Loss 1.0660  (epoch 2)
Step 400: Loss 1.0616
Step 450: Loss 1.0722
Step 500: Loss 1.0586
Step 550: Loss 1.0370
Step 600: Loss 1.0983
  → Epoch 2 avg: 1.0651 🎯 Best!

🚀 Usage

Install Dependencies

pip install transformers peft torch

Load Base Model + LoRA

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model (BF16, needs ~52 GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-26B-A4B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load this LoRA adapter
model = PeftModel.from_pretrained(
    model,
    "hotdogs/gemma4-26b-kimi-k2-reasoning-lora"
)

# Optional: merge for faster inference
model = model.merge_and_unload()

Chat / Inference

tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26B-A4B-it")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Solve: 3x + 7 = 22"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 How This Was Trained

This adapter was trained autonomously by UKA, an AI Agent running Hermes Agent, following this workflow:

1. Dataset Conversion

The Kimi K2 reasoning distill dataset comes as Parquet with a single text column in Kimi chat format (<|im_start|>role\n...<|im_end|>).

# convert_kimi.py — Parquet → JSONL messages format
import requests, pyarrow.parquet as pq, io, json, re

url = "https://huggingface.co/datasets/lordx64/reasoning-distill-kimi-k2-6-max-sft/resolve/main/data/train-00000-of-00001.parquet"
r = requests.get(url)
table = pq.read_table(io.BytesIO(r.content))
texts = table.column('text').to_pylist()

pattern = r'<\|im_start\|>(\w+)\n(.*?)<\|im_end\|>'
with open("data/kimi_k2_sft.jsonl", "w") as f:
    for text in texts:
        matches = re.findall(pattern, text, re.DOTALL)
        messages = [{"role": role.strip(), "content": content.strip()}
                    for role, content in matches]
        f.write(json.dumps({"messages": messages}, ensure_ascii=False) + "\n")

2. Pipeline Setup

git clone https://github.com/AndriejusNak/gemma4-26b-moe-finetune.git
cd gemma4-26b-moe-finetune
pip install transformers peft bitsandbytes accelerate safetensors pyarrow requests

# Edit v6_26b_pipeline.py:
#   SFT_FILES = ["data/kimi_k2_sft.jsonl"]
#   MAX_SEQ_LENGTH = 1024
#   LORA_R = 32, LORA_ALPHA = 32
#   INCLUDE_MLP_LORA = True
#   SFT_EPOCHS = 2, SFT_BATCH_SIZE = 3

3. Download Base Model + Train

python3 v6_26b_pipeline.py --phase 0    # Download model (~7 min)
python3 -u v6_26b_pipeline.py --phase 1  # Train (~2 hrs) | tee /tmp/sft.log

Hardware Notes

Why RTX 5090 needed: Gemma 4 26B MoE requires custom NF4 quantization. Standard bitsandbytes can't quantize nn.Parameter (expert weights). The pipeline quantizes experts manually, peaking at ~24 GB VRAM — fits on RTX 5090 32GB but NOT on RTX 3090 24GB (would need seq=512, no MLP LoRA).
Why PyTorch nightly: RTX 5090 = Blackwell sm_120. PyTorch stable only supports up to sm_90. Nightly cu128 is required.

📦 Files in This Repo

adapter_model.safetensors   — LoRA weights (227 MB)
adapter_config.json         — LoRA config: r=32, alpha=32, attention+MLP
tokenizer.json              — Gemma 4 tokenizer (31 MB)
tokenizer_config.json       — Tokenizer config
chat_template.jinja         — Chat template

⚠️ Limitations

32% of training examples truncated at seq=1024 (mean length = 941 tokens)
LoRA adapter only — not a full fine-tune
Trained on Kimi K2 reasoning style — may differ from Gemma's native output style
BF16 base model requires ~52 GB VRAM

🙏 Credits

Base Model: Google Gemma 4 26B
Dataset: Kimi K2 Reasoning Distill by lordx64
Pipeline: AndriejusNak/gemma4-26b-moe-finetune
Trainer: UKA — AI Agent (Hermes Agent)

Downloads last month: 126

GGUF

Model size

37.2M params

Architecture

gemma4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for hotdogs/gemma4-26b-kimi-k2-reasoning-lora

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Adapter

(35)

this model

hotdogs
/

gemma4-26b-kimi-k2-reasoning-lora