DDPM from Scratch

Loading…

noisy xₜ

Ready

Pick a digit to generate

View: predicted x₀ (what the model thinks it's heading towards) noisy xₜ (what the sampler is actually holding)

Guidance (CFG) 1.0

DDIM steps 25

Sweet spot for reliable digit generation: CFG ≈ 1–2, 10–25 steps. Push either higher and you'll often see the sample converge to the right digit and then drift into a different, often mixed shape.

Stack: Python, PyTorch. Browser port uses onnxruntime-web.

About this project

This is a from-scratch implementation of DDPM (Denoising Diffusion Probabilistic Models), the class of generative models behind Stable Diffusion and most modern image generators. The model you’re playing with above is a 15.7 M-parameter U-Net trained on MNIST for 250 epochs with classifier-free guidance, and it’s running live in your browser.

Diffusion models learn to generate images by reversing a noise process. During training, we progressively corrupt real images with Gaussian noise over many small steps until they’re indistinguishable from pure noise. Then a neural network learns to predict, at every noise level, what noise was added. At inference time we flip this around: start from pure noise, ask the network “what noise do you see?”, subtract a bit of it, repeat.

The visualization above lets you watch that loop directly. The “predicted x₀” view shows the model’s best guess at the final image at every step: it starts at noise and gradually sharpens into the asked digit. The “noisy xₜ” view shows what the sampler is literally holding in memory at each step, which stays noisy until the very end. Both are valid windows into the same process, and flipping between them is a great way to build intuition for what diffusion is actually doing.

DDPM

The training objective is remarkably simple. We sample a random image $x_0$ from the dataset, a random timestep $t \in [1, T]$, and a fresh noise vector $\epsilon \sim \mathcal{N}(0, I)$. Using a closed-form expression, we jump directly to the noisy image at step $t$:

\[x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon\]

where $\bar\alpha_t$ is determined by a fixed noise schedule (this project uses the cosine schedule from Nichol & Dhariwal, which allocates more capacity to low-noise steps). The network is trained to predict the noise $\epsilon$ that was added:

\[L = \mathbb{E}_{t, x_0, \epsilon}\left[\left\| \epsilon - \epsilon_\theta(x_t, t, y) \right\|^2\right]\]

Given a trained noise predictor, we can invert the forward process one step at a time. This project uses DDIM (Song et al., 2020), a non-Markovian reverse formulation that lets us take big steps and still get good samples, so the 200-step training-time schedule runs at inference in just 50 steps.

Class conditioning uses classifier-free guidance (Ho & Salimans, 2022): during training the class label is randomly dropped with 10% probability, teaching the model both conditional $\epsilon_\theta(x_t, t, y)$ and unconditional $\epsilon_\theta(x_t, t, \varnothing)$ behaviors in a single network. At inference, we extrapolate toward the conditional prediction:

\[\hat\epsilon = \epsilon_\theta(x_t, t, \varnothing) + w \cdot \left(\epsilon_\theta(x_t, t, y) - \epsilon_\theta(x_t, t, \varnothing)\right)\]

The guidance slider above is exactly that $w$. At $w = 1$ you get the conditional model. Push it to 5 or 6 and samples snap to a canonical template of the digit. Drop it to 0 and you get an unconditional generation, which mostly ignores your digit choice.

Resources

Explainers

3Blue1Brown & Welch Labs. But how do AI images/videos actually work?. A fantastic visual explainer of diffusion-based image generation.
Lilian Weng. What are Diffusion Models?. The go-to technical walkthrough.
Sander Dieleman. Diffusion models are autoencoders. Great intuition-building.

Papers

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021.
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models. ICLR 2021.
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop.

Twitter Facebook LinkedIn