Diffusion Models#

Forward Process#

Gradually add Gaussian noise over T steps:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

Reparameterized: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε, ε~N(0,I)

where ᾱₜ = ∏ᵢ₌₁ᵗ (1-βᵢ)

At T→∞: xₜ ≈ N(0,I)

Learn to denoise: p_θ(xₜ₋₁|xₜ) = N(xₜ₋₁; μ_θ(xₜ,t), Σ_θ(xₜ,t))

Simplified: predict the noise ε that was added:

L_simple = E_{t,x₀,ε} [‖ε - ε_θ(xₜ, t)‖²]

Equivalent to denoising score matching.

x_{T} ~ N(0,I) for t = T,…,1: z ~ N(0,I) if t>1 else 0 x_{t-1} = (1/√αₜ)(xₜ - (βₜ/√(1-ᾱₜ))ε_θ(xₜ,t)) + σₜz

T=1000 steps for high quality.

Deterministic sampler — can use 50–100 steps instead of 1000:

x_{t-1} = √ᾱ_{t-1}·x̂₀(xₜ) + √(1-ᾱ_{t-1})·ε_θ(xₜ,t)

Run diffusion in compressed latent space z = E(x) instead of pixel space:

Stable Diffusion = LDM conditioned on text via CLIP embeddings.

Cross-attention in U-Net/DiT: query from image features, key/value from text embeddings.

Classifier-free guidance: train with and without conditioning; at inference, blend: ε̂ = ε_uncond + w(ε_cond - ε_uncond)

w=7.5 typical. Higher w → more prompt adherence, less diversity.

Architecture	Use
U-Net	DDPM, Stable Diffusion 1/2
DiT (Diffusion Transformer)	Stable Diffusion 3, FLUX, Sora