Diffusion Models#

Forward Process#

Gradually add Gaussian noise over T steps:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

Reparameterized: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε, ε~N(0,I)

where ᾱₜ = ∏ᵢ₌₁ᵗ (1-βᵢ)

At T→∞: xₜ ≈ N(0,I)

Reverse Process#

Learn to denoise: p_θ(xₜ₋₁|xₜ) = N(xₜ₋₁; μ_θ(xₜ,t), Σ_θ(xₜ,t))

Training Objective (DDPM)#

Simplified: predict the noise ε that was added:

L_simple = E_{t,x₀,ε} [‖ε - ε_θ(xₜ, t)‖²]

Equivalent to denoising score matching.

DDPM Sampling#

x_{T} ~ N(0,I) for t = T,…,1: z ~ N(0,I) if t>1 else 0 x_{t-1} = (1/√αₜ)(xₜ - (βₜ/√(1-ᾱₜ))ε_θ(xₜ,t)) + σₜz

T=1000 steps for high quality.

DDIM (Faster Sampling)#

Deterministic sampler — can use 50–100 steps instead of 1000:

x_{t-1} = √ᾱ_{t-1}·x̂₀(xₜ) + √(1-ᾱ_{t-1})·ε_θ(xₜ,t)

Latent Diffusion Models (LDM)#

Run diffusion in compressed latent space z = E(x) instead of pixel space:

  • Encoder E: image → latent (4× downsampling typical)
  • Diffusion: train on z space (4-16× cheaper)
  • Decoder D: latent → image

Stable Diffusion = LDM conditioned on text via CLIP embeddings.

Text Conditioning#

Cross-attention in U-Net/DiT: query from image features, key/value from text embeddings.

Classifier-free guidance: train with and without conditioning; at inference, blend: ε̂ = ε_uncond + w(ε_cond - ε_uncond)

w=7.5 typical. Higher w → more prompt adherence, less diversity.

Architecture Choices#

Architecture Use
U-Net DDPM, Stable Diffusion 1/2
DiT (Diffusion Transformer) Stable Diffusion 3, FLUX, Sora