Diffusion Transformer (DiT) — CelebA-HQ Face Generation

A Diffusion Transformer (DiT) trained on CelebA-HQ for unconditional face image generation at 128x128 resolution. The model uses a Vision Transformer backbone in place of the conventional U-Net for the denoising network, operating in the latent space of a VQ-VAE.

Model Description

This is a two-stage pipeline:

Stage 1 — VQ-VAE: Compresses 128x128 RGB images into a 4-channel discrete latent space with codebook size 8192.
Stage 2 — DiT: A transformer-based denoising model that operates on flattened image patches in the VQ-VAE latent space.

DiT Architecture

The DiT (Peebles & Xie, 2023) replaces the U-Net backbone with a standard Vision Transformer (ViT) encoder. Each image latent is divided into non-overlapping patches, linearly embedded, and processed by a stack of transformer blocks with time-step conditioning via adaptive layer norm.

Parameter	Value
Patch size	2
Transformer layers	12
Hidden dimension	768
Attention heads	12
Head dimension	64
Time embedding dim	768
Input resolution	128x128 (latent: ~16x16x4)

VQ-VAE Architecture

Parameter	Value
Latent channels (z)	4
Codebook size	8192
Down channels	[128, 256, 384]
Downsampling stages	2

Diffusion Process

Parameter	Value
Timesteps (T)	1000
Beta schedule	Linear, start=0.0001, end=0.02

Training Details

Stage	Epochs	LR	Batch size
VQ-VAE	10	1e-5	4
DiT	500	1e-5	32

Dataset: CelebA-HQ, center-cropped and resized to 128x128, normalized to [-1, 1]
Data loaded from parquet files via a custom ParquetImageDataset

Generated Samples

The repository includes generated face samples in celebhq/samples/ (x0_*.jpg), produced by running the trained DiT in reverse diffusion from Gaussian noise.

Repository Contents

Path	Description
celeba.py	Parquet-based CelebA-HQ dataloader
celeba/config.yaml	Full training configuration
celebhq/dit_ckpt.pth	Trained DiT checkpoint
celebhq/samples/	Generated sample images

References

Peebles & Xie (2023). Scalable Diffusion Models with Transformers
Ho et al. (2020). Denoising Diffusion Probabilistic Models
van den Oord et al. (2017). Neural Discrete Representation Learning (VQ-VAE)

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Unconditional Image Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for YashNagraj75/Diffusion-Transformer