A 197K Code Review Dataset with Chain-of-Thought Reasoning Traces
We just released liodon-ai/gemma4-code-review-instruct — 197K instruction-tuning examples for training code review assistants, with 58K of them containing explicit <think> reasoning traces before the final review.
The Problem With Existing Code Review Datasets
There are a handful of code review datasets on HuggingFace. Most of them look like this:
User: [diff]
Assistant: "Consider adding error handling here."
That's it. No reasoning. No explanation of why this matters. The model learns to produce review-shaped text, but not to actually analyze the code.
The result: models fine-tuned on these datasets produce fluffy, generic comments. They flag style issues but miss the thread-safety bug on line 47.
What We Did Differently
58,000 examples include explicit chain-of-thought reasoning traces using <think> tags before the review comment. Here's what a training example looks like:
{
"messages": [
{
"role": "system",
"content": "You are an expert code review assistant. First think through the code changes step by step, then provide a clear, actionable review. Use <think> tags for your reasoning before giving the final review."
},
{
"role": "user",
"content": "Language: Python\n\nDiff:\n```diff\n@@ -14,6 +14,8 @@\n- return db.query(User).filter(User.id == user_id).first()\n+ user = db.query(User).filter(User.id == user_id).first()\n+ cache[user_id] = user\n+ return user\n```"
},
{
"role": "assistant",
"content": "<think>\nThe diff adds a cache write after a DB query. Need to check: 1) Is `cache` defined in scope? 2) No TTL — stale data risk. 3) No thread safety — concurrent writes could corrupt the dict. 4) Cache is written even if user is None.\n</think>\n\nThe caching logic has several issues:\n- **Stale data**: no TTL means updates to the user won't be reflected until restart\n- **Thread safety**: `cache[user_id] = user` is not atomic; use `threading.Lock` or switch to a thread-safe cache\n- **None caching**: if the user doesn't exist, `None` gets cached and future lookups will return `None` without hitting the DB"
}
]
}
The reasoning trace forces the model to slow down and systematically check things: scope, TTL, thread safety, edge cases. The final review becomes a consequence of actual analysis rather than pattern matching.
Dataset Stats
| Split | Examples |
|---|---|
| Train | 187,226 |
| Validation | 9,855 |
| Total | 197,081 |
| Subset | Examples |
|---|---|
With <think> reasoning traces |
~58,600 |
| Standard diff → review | ~138,500 |
Languages covered: Python · TypeScript · JavaScript · Java · C++ · Go · Rust
Where the Data Comes From
We merged two complementary sources:
ronantakizawa/github-codereview — 334K real GitHub PR review comments with full diff context, language tags, and quality scores. We filtered to quality ≥ 0.3 and removed flagged-negative examples, keeping ~138K.
AmanPriyanshu/reasoning-sft-github-codereview — 76.7K code review examples with explicit reasoning traces, categorized by review type (bug, suggestion, style, etc.). We kept ~58K after filtering.
Processing Pipeline
- Normalize — all sources converted to unified
messagesformat - Quality filter — remove low-quality comments (score < 0.3) and flagged-negative examples
- Reasoning augmentation —
<think>tags preserved from reasoning-source examples - Deduplication — MD5-based dedup on user message content across all sources
- Length filter — examples exceeding 8K estimated tokens removed
- Split — 95/5 train/validation
How to Use It
Load with the datasets library:
from datasets import load_dataset
ds = load_dataset("liodon-ai/gemma4-code-review-instruct", split="train")
print(ds[0]["messages"])
Fine-tune with TRL's SFTTrainer:
from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-4B")
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./reviewer", max_seq_length=2048),
train_dataset=ds,
processing_class=tokenizer,
)
trainer.train()
The messages column is already in chat format — just pass it through your tokenizer's apply_chat_template.
Why Reasoning Traces Matter for Code Review
A model without reasoning traces produces a review by recognizing surface patterns: "cache write → mention TTL." A model trained with reasoning traces builds a checklist: scope → staleness → concurrency → edge cases → then write the comment.
The difference shows up on non-obvious bugs. Pattern matching fails when the bug doesn't look like a known pattern. Systematic reasoning catches it anyway.
This mirrors what the best human reviewers actually do — they don't just react to code, they interrogate it.
What's Next
We're fine-tuning Qwen3.5-4B on this dataset to produce a local code review assistant that runs entirely on-device — no API key, no cloud, no sending your proprietary code anywhere. The model and GGUF quantizations will be released at liodon-ai/qwen3.5-4B-reviewer-GGUF.
Dataset: liodon-ai/gemma4-code-review-instruct License: Apache 2.0