Diffusion Models#
Forward Process#
Gradually add Gaussian noise over T steps:
q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)
Reparameterized: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε, ε~N(0,I)
where ᾱₜ = ∏ᵢ₌₁ᵗ (1-βᵢ)
At T→∞: xₜ ≈ N(0,I)
Reverse Process#
Learn to denoise: p_θ(xₜ₋₁|xₜ) = N(xₜ₋₁; μ_θ(xₜ,t), Σ_θ(xₜ,t))
Training Objective (DDPM)#
Simplified: predict the noise ε that was added:
L_simple = E_{t,x₀,ε} [‖ε - ε_θ(xₜ, t)‖²]
Equivalent to denoising score matching.
DDPM Sampling#
x_{T} ~ N(0,I) for t = T,…,1: z ~ N(0,I) if t>1 else 0 x_{t-1} = (1/√αₜ)(xₜ - (βₜ/√(1-ᾱₜ))ε_θ(xₜ,t)) + σₜz
T=1000 steps for high quality.
DDIM (Faster Sampling)#
Deterministic sampler — can use 50–100 steps instead of 1000:
x_{t-1} = √ᾱ_{t-1}·x̂₀(xₜ) + √(1-ᾱ_{t-1})·ε_θ(xₜ,t)
Latent Diffusion Models (LDM)#
Run diffusion in compressed latent space z = E(x) instead of pixel space:
- Encoder E: image → latent (4× downsampling typical)
- Diffusion: train on z space (4-16× cheaper)
- Decoder D: latent → image
Stable Diffusion = LDM conditioned on text via CLIP embeddings.
Text Conditioning#
Cross-attention in U-Net/DiT: query from image features, key/value from text embeddings.
Classifier-free guidance: train with and without conditioning; at inference, blend: ε̂ = ε_uncond + w(ε_cond - ε_uncond)
w=7.5 typical. Higher w → more prompt adherence, less diversity.
Architecture Choices#
| Architecture | Use |
|---|---|
| U-Net | DDPM, Stable Diffusion 1/2 |
| DiT (Diffusion Transformer) | Stable Diffusion 3, FLUX, Sora |