CS280A Project 5: Fun With Diffusion Models!

Xintian Liu

Part A: The Power of Diffusion Models!

Part A.0: Setup

In the notebook, we instantiate DeepFloyd's stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation.

For small num_inference_steps, for example, 5, the output fails to provide sufficient details and contains a lot of noise in the figure. The desciptor from the prompt is poorly reflected in the output. As we increases num_inference_steps, more details are showcased and the output better aligns with the content in the prompt.

Random seed I'm using here is 1998.

num_4 — Figure 1: Outputs with `num_inference_step` = 4.

Part A.1: Sampling Loops

Part A.1.1: Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. The forward process is defined by:

$$q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)\mathbf{I})$$

which is equivalent to computing

$$x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, 1)$$

That is, given a clean image $ x_0 $, we got a noisy image $ x_t $ at timestep $ t $ by sampling from a Gaussian with mean $ \sqrt{\bar{\alpha}_t} x_0 $ and variance $ 1-\bar{\alpha}_t $ Note that the forward process is not just adding the noise - we also scale the image.

In this part, we use the alphas_cumprod variable, which contains the $ \bar{\alpha}_t $ for all $ t \in [0, 999] $. $ t=0 $ corresponds to a clean image and larger $t$ corresponds to more noise. Thus, $ \bar{\alpha}_t $ is close to 1 for small $ t $, and close to 0 for larger $ t $.

Part A.1.2: Classical Denoising

In this part, we take noisy images for timesteps[250, 500, 750], but use Gaussian blur filterring to try to remove noise.

Classic Denoising — Figure 5: Gaussian-denoised version of 3 previous noisy test images.

Part A.1.3: One-Step Denoising

Now, we use a pretrained diffusion model to denoise. The actual denoiser can be found at stage_1.unet. This is a UNet that has already been trained on a very very large dataset of ($x_0, x_t$) pairs of images. We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to) the original image. The text prompt embedding used here is "a high quality photo".

Part A.1.4: Iterative Denoising

Diffusion models are designed to denoise iteratively.

Part A.1.5: Diffusion Model Sampling

Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise. Prompt used here is "a high quality photo".

Part A.1.6: Classifier-Free Guidance (CFG)

In order to greatly improve image quality (at the expense of image diversity), we can use a technique called Classifier-Free Guidance.

In CFG, we compute both a conditional and an conditional noise estimate. We denote these $\epsilon_c$ and $\epsilon_u$. Then we let our new noise estimate be:

$$\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)$$

where $\gamma$ controls the strength of CFG. Notice that for $\gamma=0$, we get an unconditional noise estimate, and for $\gamma=1$ we get the conditional noise estimate. The magic happens when $\gamma>1$. In this case, we get much higher quality images.

Classifier-Free Guidance (CFG) — Figure 9: 5 sampled images with CFG.

Part A.1.7: Image-to-image Translation

Here, we run the forward process to get a noisy test image, then run the iterative_denoise_cfg function using a starting index of [1, 3, 5, 7, 10, 20] steps.

Part A.1.7.1: Editing Hand-Drawn and Web Images

Part A.1.7.2: Inpainting

We can use the same procedure to implement inpainting. That is, given an image $x_{orig}$ and a binary mask $m$, we can create a new image that has the same content where $m$ is 0, but new content where $m$ is 1.

Part A.1.7.3: Text-Conditional Image-to-image Translation

Part A.1.8: Visual Anagrams

Flipped — Figure 14-1: Visual anagram where one orientation is `"an oil painting of an old man"` and when flipped, `"an oil painting of people around a campfire"`.

Part A.1.9: Hybrid Images

Part B: Diffusion Models from Scratch!

Part B.1: Training a Single-Step Denoising UNet

The first part is to build a simple one-step denoiser. Given a noisy image, we aim to train a denoiser $ D_0 $ such that it maps $ z $ to a clean image $ x $. To do so, we can optimize over an L2 loss:

$$L = \mathbb{E}_{z,x} \| D_\theta(z) - x \|^2$$

Part B.1.1: Implementing the UNet

In this part, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections, which uses a number of standard tensor operations.

Part B.1.2 Using the UNet to Train a Denoiser

To train our denoiser, we need to generate training data pairs of ($ z $, $ x $), where each $ x $ is a clean MNIST digit. For each training batch, we can generate $ z $ from $ x $ using the the following noising process:

$$z = x + \sigma \epsilon, \quad \text{where } \epsilon \sim N(0, I).$$

Visualize the different noising processes over $$\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$$assuming normalized $$x \in [0, 1]$$.

Part B.1.2.1: Training

Result Epoch 1 — Figure 5: Results on digits from the test set after 1 epoch of training

Result Epoch 5 — Figure 6: Results on digits from the test set after 5 epoch of training

Part B.1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $ \sigma=0.5 $. Visualize the denoiser results on test set digits with varying levels of noise

$$\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$$

Figure 7: Results on digits from the test set with varying noise levels

Part B.2: Training a Diffusion Model

Part B.2.1: Adding Time Conditioning to UNet

We need a way to inject scalar $ t $ into our UNet model to condition it.

condition UNet — Figure 8: Conditioned UNet

Part B.2.2: Training the UNet

Basically, we pick a random image from the training set, a random $ t $, and train the denoiser to predict the noise in $ x $. We repeat this for different images and different $ t $ values until the model converges and we are happy.

Figure 9: Algorithm B.1. Training time-conditioned UNet

Part B.2.3 Sampling from the UNet

Part B.2.4: Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9.

Part B.2.5: Sampling from the Class-Conditioned UNet

The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance. Use classifier-free guidance with $ \gamma=0.5 $ for this part.et on the class of the digit 0-9.