Scalable Diffusion Models with Transformers
Paper β’ 2212.09748 β’ Published β’ 17
A Diffusion Transformer (DiT) trained on CelebA-HQ for unconditional face image generation at 128x128 resolution. The model uses a Vision Transformer backbone in place of the conventional U-Net for the denoising network, operating in the latent space of a VQ-VAE.
This is a two-stage pipeline:
The DiT (Peebles & Xie, 2023) replaces the U-Net backbone with a standard Vision Transformer (ViT) encoder. Each image latent is divided into non-overlapping patches, linearly embedded, and processed by a stack of transformer blocks with time-step conditioning via adaptive layer norm.
| Parameter | Value |
|---|---|
| Patch size | 2 |
| Transformer layers | 12 |
| Hidden dimension | 768 |
| Attention heads | 12 |
| Head dimension | 64 |
| Time embedding dim | 768 |
| Input resolution | 128x128 (latent: ~16x16x4) |
| Parameter | Value |
|---|---|
| Latent channels (z) | 4 |
| Codebook size | 8192 |
| Down channels | [128, 256, 384] |
| Downsampling stages | 2 |
| Parameter | Value |
|---|---|
| Timesteps (T) | 1000 |
| Beta schedule | Linear, start=0.0001, end=0.02 |
| Stage | Epochs | LR | Batch size |
|---|---|---|---|
| VQ-VAE | 10 | 1e-5 | 4 |
| DiT | 500 | 1e-5 | 32 |
ParquetImageDatasetThe repository includes generated face samples in celebhq/samples/ (x0_*.jpg), produced by running the trained DiT in reverse diffusion from Gaussian noise.
| Path | Description |
|---|---|
| celeba.py | Parquet-based CelebA-HQ dataloader |
| celeba/config.yaml | Full training configuration |
| celebhq/dit_ckpt.pth | Trained DiT checkpoint |
| celebhq/samples/ | Generated sample images |
MIT