A 197K Code Review Dataset with Chain-of-Thought Reasoning Traces

Community Article
Published June 21, 2026

Most code review datasets teach a model to look at a diff and produce a comment. Ours teaches it to think first.

We just released liodon-ai/gemma4-code-review-instruct — 197K instruction-tuning examples for training code review assistants, with 58K of them containing explicit <think> reasoning traces before the final review.


The Problem With Existing Code Review Datasets

There are a handful of code review datasets on HuggingFace. Most of them look like this:

User: [diff]
Assistant: "Consider adding error handling here."

That's it. No reasoning. No explanation of why this matters. The model learns to produce review-shaped text, but not to actually analyze the code.

The result: models fine-tuned on these datasets produce fluffy, generic comments. They flag style issues but miss the thread-safety bug on line 47.


What We Did Differently

58,000 examples include explicit chain-of-thought reasoning traces using <think> tags before the review comment. Here's what a training example looks like:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an expert code review assistant. First think through the code changes step by step, then provide a clear, actionable review. Use <think> tags for your reasoning before giving the final review."
    },
    {
      "role": "user",
      "content": "Language: Python\n\nDiff:\n```diff\n@@ -14,6 +14,8 @@\n-    return db.query(User).filter(User.id == user_id).first()\n+    user = db.query(User).filter(User.id == user_id).first()\n+    cache[user_id] = user\n+    return user\n```"
    },
    {
      "role": "assistant",
      "content": "<think>\nThe diff adds a cache write after a DB query. Need to check: 1) Is `cache` defined in scope? 2) No TTL — stale data risk. 3) No thread safety — concurrent writes could corrupt the dict. 4) Cache is written even if user is None.\n</think>\n\nThe caching logic has several issues:\n- **Stale data**: no TTL means updates to the user won't be reflected until restart\n- **Thread safety**: `cache[user_id] = user` is not atomic; use `threading.Lock` or switch to a thread-safe cache\n- **None caching**: if the user doesn't exist, `None` gets cached and future lookups will return `None` without hitting the DB"
    }
  ]
}

The reasoning trace forces the model to slow down and systematically check things: scope, TTL, thread safety, edge cases. The final review becomes a consequence of actual analysis rather than pattern matching.


Dataset Stats

Split Examples
Train 187,226
Validation 9,855
Total 197,081
Subset Examples
With <think> reasoning traces ~58,600
Standard diff → review ~138,500

Languages covered: Python · TypeScript · JavaScript · Java · C++ · Go · Rust


Where the Data Comes From

We merged two complementary sources:

ronantakizawa/github-codereview — 334K real GitHub PR review comments with full diff context, language tags, and quality scores. We filtered to quality ≥ 0.3 and removed flagged-negative examples, keeping ~138K.

AmanPriyanshu/reasoning-sft-github-codereview — 76.7K code review examples with explicit reasoning traces, categorized by review type (bug, suggestion, style, etc.). We kept ~58K after filtering.

Processing Pipeline

  1. Normalize — all sources converted to unified messages format
  2. Quality filter — remove low-quality comments (score < 0.3) and flagged-negative examples
  3. Reasoning augmentation<think> tags preserved from reasoning-source examples
  4. Deduplication — MD5-based dedup on user message content across all sources
  5. Length filter — examples exceeding 8K estimated tokens removed
  6. Split — 95/5 train/validation

How to Use It

Load with the datasets library:

from datasets import load_dataset

ds = load_dataset("liodon-ai/gemma4-code-review-instruct", split="train")
print(ds[0]["messages"])

Fine-tune with TRL's SFTTrainer:

from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-4B")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./reviewer", max_seq_length=2048),
    train_dataset=ds,
    processing_class=tokenizer,
)
trainer.train()

The messages column is already in chat format — just pass it through your tokenizer's apply_chat_template.


Why Reasoning Traces Matter for Code Review

A model without reasoning traces produces a review by recognizing surface patterns: "cache write → mention TTL." A model trained with reasoning traces builds a checklist: scope → staleness → concurrency → edge cases → then write the comment.

The difference shows up on non-obvious bugs. Pattern matching fails when the bug doesn't look like a known pattern. Systematic reasoning catches it anyway.

This mirrors what the best human reviewers actually do — they don't just react to code, they interrogate it.


What's Next

We're fine-tuning Qwen3.5-4B on this dataset to produce a local code review assistant that runs entirely on-device — no API key, no cloud, no sending your proprietary code anywhere. The model and GGUF quantizations will be released at liodon-ai/qwen3.5-4B-reviewer-GGUF.


Dataset: liodon-ai/gemma4-code-review-instruct License: Apache 2.0

Community

Sign up or log in to comment