CS180 Project 5

Image Warping & Mosaicing

Kevin Yee

Overview

So in this project, it seems as though we're making our own diffusion models.

Part 0: Setup

So for this part I made a HuggingFace account, got my API token, and ran the first few cells to setup and download environment setup things, followed by the generation of a feww images.
The first set of images is with num_inference_steps set to 20, and the second set is set to 30.

First three generated images 20 steps First three generated images 30 steps (snowy mountian) First three generated images 30 steps (man wearing a hat) First three generated images 30 steps (rocket ship)

To me, it seems like the quality of the "man wearing a hat" seemed to substantially increase, going from an ai-esque generated image to what I would call a passable if not somewhat artistically filtered image of a man wearing a hat. The snowy mountian definitley still seems quite stylized in the strange colorization of AI between the two generations, while the rocket ship seems to have marginally improved from a PNG to a flash-game worthy sprite (with extra special booster effects).
The random seed I'm using for everything is: "137347080577163115"

Part 1: Sampling Loops

1.1 Implementing the Forward Process

For this part, it was basically implementing equation A.2 in the forward function which was basically the following:
Get alpha bar (alpha cumulative product), square root it, multiply it by the clean image, and add in the noise component.

Campanile Noise at 250 Campanile Noise at 500 Campanile Noise at 750

1.2 Classical Denoising

I basically ran torchvision.transforms.functional.gaussian_blur over the noisy image, with a kernel size of kernel_size by kernel_size, and a sigma of 1.0.

Noisy 250

Noisy 250

Gaussian 250

Gaussian 250

Noisy 500

Noisy 500

Gaussian 500

Gaussian 500

Noisy 750

Noisy 750

Gaussian 750

Gaussian 750

1.3 One-Step Denoising

The one step denoising if I understand it correctly and didn't just follow directions, basically takes the estimated noise that was added to the image, and then removes it. It's not quite the same thing as just removing the noise since it only takes the estimated noise from the trained UNet and then removes the predicted/estimated noise.

Noisy 250

Noisy 250

Gaussian 250

Gaussian 250

Noisy 500

Noisy 500

Gaussian 500

Gaussian 500

Noisy 750

Noisy 750

Gaussian 750

Gaussian 750

1.4 Iterative Denoising

So iterative denoising: I'm pretty sure what we're doing is something along the lines of getting the UNet to predict the noise at each timestep, and then running it a whole bunch of times. Noise is iteratively added to the image, follwed by iterative denoising with the UNet.

Noisy 690

t = 690

Noisy 540

t = 540

Noisy 390

t = 390

Noisy 240

t = 240

Noisy 90

t = 90

Iteratively Denoised

Iteratively Denoised dCampanile

One Step Denoised

One Step Denoised Campanile

Gaussian Denoised

Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

So this part was basically taking in random noise from scratch instead of adding noise to an image, and generating images from there. i_start = 0 & just send it with random noise from torch.randn.

Diffusion Sample 1 Diffusion Sample 2 Diffusion Sample 3 Diffusion Sample 4 Diffusion Sample 5

1.6 Classifier-Free Guidance

So for this part, conditional and unconditional noise is estimated with a fancy formula, epsilon_conditional and epsilon_unconditional. It takes the unconditional, and adds the difference of the conditional and unconditional multiplied by a gamma factor, or the strength of the CFG. Setting the gamma makes the image quality go brrr apparantly.

CFG Sample 1 CFG Sample 2 CFG Sample 3 CFG Sample 4 CFG Sample 5

1.7 Image-to-Image Translation

So for this part, we add noise to the original image, and then iteratively denoise with CFG. More noise = bigger edit/more hallucinations/more space to be creative.


Coit Tower

Coit Tower Edit 1

Noise Level 1

Coit Tower Edit 3

Noise Level 3

Coit Tower Edit 5

Noise Level 5

Coit Tower Edit 7

Noise Level 7

Coit Tower Edit 10

Noise Level 10

Coit Tower Edit 20

Noise Level 20

Coit Tower OG

Golden Gate

Golden Gate Edit 1

Noise Level 1

Golden Gate Edit 3

Noise Level 3

Golden Gate Edit 5

Noise Level 5

Golden Gate Edit 7

Noise Level 7

Golden Gate Edit 10

Noise Level 10

Golden Gate Edit 20

Noise Level 20

Golden Gate OG

Campanile

Campanile Edit 1

Noise Level 1

Campanile Edit 3

Noise Level 3

Campanile Edit 5

Noise Level 5

Campanile Edit 7

Noise Level 7

Campanile Edit 10

Noise Level 10

Campanile Edit 20

Noise Level 20

Campanile OG

1.7.1 Editing Hand-Drawn and Web Images

The hand drawn section I guess is more or less the same, where we try to bring the poorly drawn cursor images back into the "a high quality photo" manifold space.


Avocado (Internet Image)

Avocado Level 1

Noise Level 1

Avocado Level 3

Noise Level 3

Avocado Level 5

Noise Level 5

Avocado Level 7

Noise Level 7

Avocado Level 10

Noise Level 10

Avocado Level 20

Noise Level 20

Avocado Original

Smiley Face Drawing

Smiley Level 1

Noise Level 1

Smiley Level 3

Noise Level 3

Smiley Level 5

Noise Level 5

Smiley Level 7

Noise Level 7

Smiley Level 10

Noise Level 10

Smiley Level 20

Noise Level 20


House Drawing

House Draw 1

Noise Level 1

House Draw 3

Noise Level 3

House Draw 5

Noise Level 5

House Draw 7

Noise Level 7

House Draw 10

Noise Level 10

House Draw 20

Noise Level 20

House Original

Original Drawing

1.7.2 Inpainting

So for inpainting, there's a mask of areas 0 & 1, where 0 is the area to keep unchanged, and 1 is the area to inpaint/replace with new information. The mask is applied to the noisy image xt at each timestep t to ensure that areas outside of t he mask stay the same.
At each timestep I basically followed the equation 0 * xt + (1- 0)forward(x original t)


Inpainted Campanile

Campanile

Inpainted Coit Tower

Coit Tower

Inpainted Golden Gate

Golden Gate

1.7.3 Text-Conditional Image-to-Image Translation

Alright, same thing as the SDEdit, but instead with the prompt of "a rocket ship" instead, so that a rocket ship looks more and more like the campanile (and a few other landmarks as well).

Campanile

Conditional Campanile 1

Noise Level 1

Conditional Campanile 3

Noise Level 3

Conditional Campanile 5

Noise Level 5

Conditional Campanile 7

Noise Level 7

Conditional Campanile 10

Noise Level 10

Conditional Campanile 20

Noise Level 20

Conditional Campanile Original

Campanile


Coit Tower

Conditional Coit 1

Noise Level 1

Conditional Coit 3

Noise Level 3

Conditional Coit 5

Noise Level 5

Conditional Coit 7

Noise Level 7

Conditional Coit 10

Noise Level 10

Conditional Coit 20

Noise Level 20

Conditional Coit Original

Coit


Golden Gate

Conditional Golden Gate 1

Noise Level 1

Conditional Golden Gate 3

Noise Level 3

Conditional Golden Gate 5

Noise Level 5

Conditional Golden Gate 7

Noise Level 7

Conditional Golden Gate 10

Noise Level 10

Conditional Golden Gate 20

Noise Level 20

Conditional Golden Gate Original

Golden Gate

1.8 Visual Anagrams

For this part, we start with a noisy image, and then the noise estimate for epsilon1 for rightside up with the first prompt p1 is calculated, then the noisy image is flipped upside-down, and the noise estimate is computed for the upside-down version. And then the noises are averaged (epsilon1 + epsilon2) / 2, and then this epsilon average is used to complete the reverse diffusion step.

Campfire

Campfire

Old Man

Old Man

Snowy Mountian

Snowy Mountian Village

Campfire

Campfire

Mountian Village

Snowy Mountian Village

Old Man

Old Man

1.9 Hybrid Images

Factorized diffusion for things: two noise estimates, epsilon1 will have a lowpass filter to retain the low frequency, while epsilon2 gets a highpass filter to maintain the high frequencies. These two get added and are used as the sum epsilon (or noise estimate). I ran this a bunch of times and I didn't get super great results for any of them. I think the better results come out of things that are more abstract or perhaps have a bit more varience in the manifold or something like that, since alot of times I just got a landscape with a skull slapped in the middle of it.

Skull Waterfall

Skull Waterfall

Old Man

Coast Snowy

Pencil Rocket

Pencil Rocket

Part B:

So for this part, a denoiser as a UNet was implemented, which mostly consisted of carefully reading the diagram and documentation of the blocks and implementing them as a class. Conv blocks changing the channel dimension, DownConv which downsampled, UpConv which upsampled DFlatten which takes a 7x7 -> 1x1 tensor, Unflatten which upsamples from 1x1 -> 7x7, & Concat which channel-wise concatentats tensors.
To train the model, we were given an equation to optimize overe an L2 loss, where noise is applied dynamically during the training set, and then the images are passed through the UNet for predictions, followed by a MSE calclation (equation) between the predicted clean image and the actual clean image. The gradient of the loss wrt the model;s parameters is calculated, and then the model's parameters are updated. This is repeated for batches & epochs.

Figure 3 Epoch 1

Epoch 1

Epoch 5

Epoch 5

Training Loss Curve Varying Noise levels