Papers
arxiv:2605.21981

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Published on May 21
· Submitted by
le.zhang
on May 21
Authors:
,
,

Abstract

Flow matching in representation spaces with improved statistical properties enables efficient diffusion model training with reduced parameters and fast sampling.

AI-generated summary

Flow matching with x-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space li2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both d!approx!33) yet DINOv2 exhibits 7.3times higher effective rank, 35times better covariance conditioning, 11.5times lower excess kurtosis, and 1.7times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint [CLS]-patch modeling. On ImageNet 256{times}256, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT^DH-XL with 19% fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, 5 Heun steps already reach FID 2.0 and 10 steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

Community

Paper submitter

RiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space

This repository hosts the released RiT-XL checkpoint trained for 800 epochs
on ImageNet 256×256 with frozen DINOv2-Small features.

GitHub
Paper

Results on ImageNet 256×256

Method Encoder Params FID ↓ (CFG=1) FID ↓ (CFG≈3.7)
DiT-XL SD-VAE 675M 9.62 2.27
SiT-XL SD-VAE 675M 8.61 2.06
REPA-XL SD-VAE 675M 5.78 1.29
DDT-XL SD-VAE 675M 6.27 1.26
REG-XL SD-VAE 675M 1.80 1.36
RAE-XL DINOv2-S 676M 1.87 1.41
RAE-XLDH DINOv2-B 839M 1.51 1.16
FAE-XL FAE-DINOv2-G 675M 1.48 1.29
RiT-XL (ours) DINOv2-S 676M 1.45 1.14

All FIDs use 25 Heun steps with the time-shift schedule.

Few-step generation (no distillation, no consistency training):

Heun steps 5 10 25 50
FID (CFG=1.0) 2.44 1.59 1.47 1.46
FID (CFG=3.7) 1.99 1.27 1.15 1.15

Quick start

The full training/inference code lives at
lezhang7/RiT. The eval script auto-pulls
this checkpoint plus the matching RAE decoder on first run:

git clone https://github.com/lezhang7/RiT.git
cd RiT
pip install -r requirements.txt
bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256

To download just the weights manually:

from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
import torch
state = torch.load(ckpt, map_location="cpu", weights_only=False)
# state['model'] / state['model_ema1'] / state['model_ema2'] are the
# trainable + two EMA-decay parameter dictionaries.

Checkpoint contents

checkpoint-last.pth is a PyTorch checkpoint produced after 740 training
epochs (the released model used for the paper's headline numbers). Top-level
keys:

  • model — main parameters of the Denoiser (RiT-XL backbone).
  • model_ema1 — EMA decay 0.9999 (used for sampling by default).
  • model_ema2 — EMA decay 0.9996 (tracked but unused at inference).
  • optimizer — AdamW state for resuming training.
  • epoch740.
  • args — argparse namespace from the original training run (legacy
    JiT-RAE-XL/16 model name; the architecture matches the released
    RiT-XL/16).

Loading uses only model / model_ema*, so the legacy args field does not
matter — eval.sh constructs the model from the CLI flags.

Model details

  • Architecture: vanilla Diffusion Transformer — 28 layers, hidden 1152,
    16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class
    tokens, joint [CLS]-patch modeling.
  • Encoder (frozen): facebook/dinov2-with-registers-small (d=384).
  • Decoder (frozen): ViT-MAE-style decoder from
    nyu-visionx/RAE-collections,
    variant decoders/dinov2/wReg_small/ViTXL_n08/model.pt.
  • Parameters (denoiser only): 676M.
  • Training: 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this
    ckpt: epoch 740), x-prediction loss, dimension-aware time shift
    (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
  • Sampling defaults: Heun, 25 steps, time-shift schedule, CFG=3.7 in
    interval [0.1, 0.98], coupled-noise initialization for [CLS].

Citation

@article {zhang2025rit,
  title  = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},
  author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
  year   = {2025}
}

Acknowledgments

This release reuses the frozen DINOv2 encoder + ViT decoder pairing from
RAE and adopts the modernized DiT
block design + in-context class tokens from JiT.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.21981
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21981 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21981 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21981 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.