arxiv:2604.01193

Embarrassingly Simple Self-Distillation Improves Code Generation

Published on Apr 1

· Submitted by

taesiri on Apr 2

Apple

Upvote

Authors:

Richard He Bai ,

Abstract

Simple self-distillation improves code generation in large language models by fine-tuning on model-generated samples, effectively addressing precision-exploration trade-offs in decoding.

AI-generated summary

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

View arXiv page View PDF GitHub 429 Add to collection

Community

urroxyz

4 days ago

•

edited 3 days ago

This is awesome!

I wonder if some filtering could improve training metrics even beyond what you've found, such as using a cheap proxy model to disambiguate outputs as "helpful" and "not so helpful".

Correctness is not the same thing as usefulness. That is clear. So using an equivalent fewer-parameter model would be especially relevant. (Not a classifier, but a training test-run that gives useful signals i.e., loss & evals.)

urroxyz

about 2 hours ago

•

edited about 2 hours ago

Quick follow-up after actually trying this direction:

I tested a variant where a small proxy model (sub-1B) is used to score generations by "usefulness" (rather than correctness), and then used that to filter SSD data.

On my setup (~3B student, ~0.5B proxy), proxy-filtered SSD consistently beat both raw SSD and correctness filtering on HumanEval+ (by ~13-21 points depending on baseline). So at least in this regime, it does seem like there's real signal in 'usefulness ≠ correctness', and a cheap model can pick it up!

One interesting detail: a lot of the high-utility samples selected by the proxy were actually incorrect/non-executable, but still seemed to improve downstream training. So, just like your guy's paper found, it's not about finding hidden correct solutions, but structural utility.

Super cool, in my opinion.

Caveat: none of the fine-tuned variants beat the base model yet, so this feels more like a better SSD selection rule than a full win over the base setup... Also still working through LiveCodeBench, which should be a better test of whether this holds up on harder problems.

But I'm sure if an actual lab looked into this they'd probably be able to make it work with much more general improvement. I'm just a single guy.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Quartich

about 10 hours ago

Interesting premise. Sounds similar to some of the refining techniques that were coming out last year, maybe the mechanism is different? Only had time to skim so far. Sounds promising based on current results.