Loading…
noisy xₜ
Ready
Pick a digit to generate
View:
1.0
25
Sweet spot for reliable digit generation: CFG ≈ 1–2, 10–25 steps. Push either higher and you'll often see the sample converge to the right digit and then drift into a different, often mixed shape.

Stack: Python, PyTorch. Browser port uses onnxruntime-web.

About this project

This is a from-scratch implementation of DDPM (Denoising Diffusion Probabilistic Models), the class of generative models behind Stable Diffusion and most modern image generators. The model you’re playing with above is a 15.7 M-parameter U-Net trained on MNIST for 250 epochs with classifier-free guidance, and it’s running live in your browser.

Diffusion models learn to generate images by reversing a noise process. During training, we progressively corrupt real images with Gaussian noise over many small steps until they’re indistinguishable from pure noise. Then a neural network learns to predict, at every noise level, what noise was added. At inference time we flip this around: start from pure noise, ask the network “what noise do you see?”, subtract a bit of it, repeat.

The visualization above lets you watch that loop directly. The “predicted x₀” view shows the model’s best guess at the final image at every step: it starts at noise and gradually sharpens into the asked digit. The “noisy xₜ” view shows what the sampler is literally holding in memory at each step, which stays noisy until the very end. Both are valid windows into the same process, and flipping between them is a great way to build intuition for what diffusion is actually doing.

DDPM

The training objective is remarkably simple. We sample a random image $x_0$ from the dataset, a random timestep $t \in [1, T]$, and a fresh noise vector $\epsilon \sim \mathcal{N}(0, I)$. Using a closed-form expression, we jump directly to the noisy image at step $t$:

\[x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon\]

where $\bar\alpha_t$ is determined by a fixed noise schedule (this project uses the cosine schedule from Nichol & Dhariwal, which allocates more capacity to low-noise steps). The network is trained to predict the noise $\epsilon$ that was added:

\[L = \mathbb{E}_{t, x_0, \epsilon}\left[\left\| \epsilon - \epsilon_\theta(x_t, t, y) \right\|^2\right]\]

Given a trained noise predictor, we can invert the forward process one step at a time. This project uses DDIM (Song et al., 2020), a non-Markovian reverse formulation that lets us take big steps and still get good samples, so the 200-step training-time schedule runs at inference in just 50 steps.

Class conditioning uses classifier-free guidance (Ho & Salimans, 2022): during training the class label is randomly dropped with 10% probability, teaching the model both conditional $\epsilon_\theta(x_t, t, y)$ and unconditional $\epsilon_\theta(x_t, t, \varnothing)$ behaviors in a single network. At inference, we extrapolate toward the conditional prediction:

\[\hat\epsilon = \epsilon_\theta(x_t, t, \varnothing) + w \cdot \left(\epsilon_\theta(x_t, t, y) - \epsilon_\theta(x_t, t, \varnothing)\right)\]

The guidance slider above is exactly that $w$. At $w = 1$ you get the conditional model. Push it to 5 or 6 and samples snap to a canonical template of the digit. Drop it to 0 and you get an unconditional generation, which mostly ignores your digit choice.

Resources

Explainers

Papers