arxiv:2605.26108

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Published on May 25

· Submitted by

Yushi Huang on May 26

Tencent Hunyuan

Upvote

Authors:

Yushi Huang ,

Abstract

RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.

AI-generated summary

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

View arXiv page View PDF GitHub 34 Add to collection

Community

Harahan

Paper author Paper submitter 3 days ago

We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a
two-stage framework that unifies distribution-matching distillation with
reward-guided RL for few-step flow generators. Minimizing the KL divergence to
a reward-tilted teacher distribution decomposes naturally into a
distribution-matching term and a reward-maximization term — instantiated
as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy
gradient (SubGRPO + final-step reward back-propagation) for the RL stage.
With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the
distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most
rewards.

avahal

2 days ago

the core idea of tilting the teacher with a reward and then splitting the KL into a dist-matching term and a reward-maximization term is clean and practically appealing. stage i's ambient-consistent distribution matching and stage ii's hybrid gradient with step-subset grpo look like they stabilize training in a tight 4-step regime. the arxivlens breakdown helped me parse the method details, especially how the consistency regularizer keeps the fake score aligned as the generator shifts (https://arxivlens.com/PaperView/Details/reinforcing-few-step-generators-via-reward-tilted-distribution-matching-8719-f7e85876). my one question is about ablations on the ac-dmd subintervals: how sensitive is performance to the number of subintervals, and did you try adaptive or learned partitioning rather than fixed blocks? this seems like a solid blueprint for fast, preference-aligned generation, with a practical angle for real-world deployment.

Harahan

Paper author 2 days ago

•

edited 2 days ago

Hi, we directly adopt 4 subintervals for a 4-step generator, which is a natural choice. Other trials will be left for future research. Thanks for your appreciation.

librarian-bot

about 18 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26108

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.