In the notebook, we instantiate DeepFloyd's stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation.
For small num_inference_steps
, for example, 5, the output fails to provide sufficient
details and contains a lot of noise in the figure. The desciptor from the prompt is poorly reflected
in the output. As we increases num_inference_steps
, more details are showcased and
the output better aligns with the content in the prompt.
Random seed I'm using here is 1998.
A key part of diffusion is the forward process, which takes a clean image and adds noise to it. The forward process is defined by:
$$q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)\mathbf{I})$$
which is equivalent to computing
$$x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, 1)$$
That is, given a clean image \( x_0 \), we got a noisy image \( x_t \) at timestep \( t \) by sampling from a Gaussian with mean \( \sqrt{\bar{\alpha}_t} x_0 \) and variance \( 1-\bar{\alpha}_t \) Note that the forward process is not just adding the noise - we also scale the image.
In this part, we use the alphas_cumprod
variable, which contains the \( \bar{\alpha}_t \)
for all \( t \in [0, 999] \). \( t=0 \) corresponds to a clean image and larger \(t\) corresponds to more noise.
Thus, \( \bar{\alpha}_t \) is close to 1 for small \( t \), and close to 0 for larger \( t \).
In this part, we take noisy images for timesteps[250, 500, 750], but use Gaussian blur filterring to try to remove noise.
Now, we use a pretrained diffusion model to denoise. The actual denoiser can be found at stage_1.unet
.
This is a UNet that has already been trained on a very very large dataset of (\(x_0, x_t\)) pairs of images.
We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to)
the original image.
The text prompt embedding used here is "a high quality photo"
.
Diffusion models are designed to denoise iteratively.
Another thing we can do with the iterative_denoise
function is to
generate images from scratch. We can do this by setting i_start = 0
and passing in random noise. This effectively denoises pure noise. Prompt used here is
"a high quality photo"
.
In order to greatly improve image quality (at the expense of image diversity), we can use a technique called Classifier-Free Guidance.
In CFG, we compute both a conditional and an conditional noise estimate. We denote these \(\epsilon_c\) and \(\epsilon_u\). Then we let our new noise estimate be:
$$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$
where \(\gamma\) controls the strength of CFG. Notice that for \(\gamma=0\), we get an unconditional noise estimate, and for \(\gamma=1\) we get the conditional noise estimate. The magic happens when \(\gamma>1\). In this case, we get much higher quality images.
Here, we run the forward process to get a noisy test image, then
run the iterative_denoise_cfg
function using a starting index of
[1, 3, 5, 7, 10, 20] steps.
We can use the same procedure to implement inpainting. That is, given an image \(x_{orig}\) and a binary mask \(m\), we can create a new image that has the same content where \(m\) is 0, but new content where \(m\) is 1.
The first part is to build a simple one-step denoiser. Given a noisy image, we aim to train a denoiser \( D_0 \) such that it maps \( z \) to a clean image \( x \). To do so, we can optimize over an L2 loss:
$$L = \mathbb{E}_{z,x} \| D_\theta(z) - x \|^2$$
In this part, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections, which uses a number of standard tensor operations.
To train our denoiser, we need to generate training data pairs of (\( z \), \( x \)), where each \( x \) is a clean MNIST digit. For each training batch, we can generate \( z \) from \( x \) using the the following noising process:
$$z = x + \sigma \epsilon, \quad \text{where } \epsilon \sim N(0, I).$$
Visualize the different noising processes over $$\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$$assuming normalized $$x \in [0, 1]$$.
Our denoiser was trained on MNIST digits noised with \( \sigma=0.5 \). Visualize the denoiser results on test set digits with varying levels of noise
$$\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$$
We need a way to inject scalar \( t \) into our UNet model to condition it.
Basically, we pick a random image from the training set, a random \( t \), and train the denoiser to predict the noise in \( x \). We repeat this for different images and different \( t \) values until the model converges and we are happy.
To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9.
The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance. Use classifier-free guidance with \( \gamma=0.5 \) for this part.et on the class of the digit 0-9.