CS 180 Project 5: Diffusion Models

By [Your Name]

Overview

This project explores the implementation and applications of diffusion models, from basic noise addition to advanced image manipulation techniques. Using a random seed of 180, we investigated various aspects of diffusion models including denoising, sampling, and image translation.

Setup

Initial experiments with different inference steps demonstrated the model's capabilities. Looking at the three sets of images with different inference steps (5, 20, and 50), we can observe that with more inference steps, the generated images become significantly clearer and more detailed. The progression shows that with 5 steps the images appear rough and noisy, but as we increase to 20 and then 50 steps, the images gain better definition, smoother transitions, and more refined details, particularly visible in the winter landscape scenes, portrait photos, and rocket illustrations

5 Steps

5 steps inference 5 steps inference 5 steps inference

20 Steps

20 steps inference 20 steps inference 20 steps inference

50 Steps

50 steps inference 50 steps inference 50 steps inference

1.1 Forward Process

Implemented the forward process using the equation:

\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon \text{ where } \epsilon \sim N(0,1) \]

The forward function systematically corrupts a clean image by adding controlled amounts of Gaussian noise while simultaneously scaling down the original image's influence. For the Campanile image, we can observe this corruption process across three critical timesteps: At t=250, while the tower's basic structure remains visible, random colored noise begins to interfere with the image clarity, particularly noticeable in the sky and around the tower's edges. Moving to t=500, the degradation becomes more severe - the Campanile's distinctive shape is significantly obscured, with the noise patterns creating a colorful static that makes architectural details difficult to discern. Finally, at t=750, the image has almost completely succumbed to random noise, with only the faintest ghost-like outline of the tower barely visible through a dense field of multicolored pixels. This progression demonstrates how the forward process gradually transforms a clear, structured image into increasingly random noise patterns, effectively "forgetting" the original image's information in a controlled manner.

Original Image
Original Image of the Campanile
Original Image
Noise Level 250
Original Image
Noise Level 500
Original Image
Noise Level 750

1.2 Classical Denoising

Applied Gaussian blur filtering to denoise images at timesteps [250, 500, 750]. Results showed limitations of classical methods in handling complex noise patterns.

Original Image
t=250
Original Image
t=500
Original Image
t=750
Original Image
Gaussian Blur Denoising at t=250
Original Image
Gaussian Blur Denoising at t=500
Original Image
Gaussian Blur Denoising at t=750

1.3 One-Step Denoising

Utilized the pretrained UNet model to estimate and remove noise in a single step. Results demonstrated significant improvement over classical methods.

Original Image
t=250
Original Image
t=500
Original Image
t=750
Original Image
One Step Denoising at t=250
Original Image
One Step Denoising at t=500
Original Image
One Step Denoising at t=750

1.4 Iterative Denoising

The iterative denoising process demonstrates how diffusion models can effectively restore images through multiple steps, starting from highly noisy states. Using a stride of 30 timesteps from 990 down to 0, we can observe the gradual improvement in image quality as the model progressively removes noise. When starting at timestep, the process shows remarkable improvement over single-step denoising or classical Gaussian blurring methods. Each denoising iteration applies a careful balance of noise removal and signal preservation, guided by the model's learned understanding of natural images. The results clearly show how multiple smaller denoising steps produce significantly better results than attempting to denoise in a single large step, with the final image retaining more detail and natural characteristics of the original image.

Original Image
Noisy Campanile at t=690
Original Image
Noisy Campanile at t=540
Original Image
Noisy Campanile at t=390
Original Image
Noisy Campanile at t=240
Original Image
Noisy Campanile at t=90
Original Image
Iteratively Denoised Campanile
Original Image
Original Image
Original Image
Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

The model, guided by the prompt "a high quality photo," gradually transforms random noise into a coherent image by repeatedly estimating and removing noise, using learned patterns from its training data. Each step in the denoising process brings the image closer to the natural image manifold while maintaining consistency with the text prompt. This demonstrates how diffusion models can create meaningful structure from pure randomness through a series of small, controlled steps that gradually increase the signal-to-noise ratio until a clear image emerges.

Original Image
Sample 1
Original Image
Sample 2
Original Image
Sample 3
Original Image
Sample 4
Original Image
Sample 5

1.6 Classifier-Free Guidance

Classifier-Free Guidance (CFG) enhances image generation by combining two different noise estimates: a conditional estimate based on the given prompt ("a high quality photo") and an unconditional estimate using an empty prompt (""). The process uses a scaling factor (cfg_scale) to blend these estimates according to the formula: noise_est = noise_est_uncond + cfg_scale * (noise_est_cond - noise_est_uncond). When cfg_scale > 1, we amplify the difference between conditional and unconditional estimates, which tends to produce higher quality images that more strongly align with the prompt. The iterative_denoise_cfg function implements this by running the UNet model twice at each denoising step - once with the prompt embeddings and once with null embeddings - then combining their predictions using the CFG formula before proceeding with the denoising step. This dual prediction approach helps the model generate images that are both high quality and more faithful to the desired prompt. We used a cfg_scale of 7 for these images below:

Original Image
Sample 1
Original Image
Sample 2
Original Image
Sample 3
Original Image
Sample 4
Original Image
Sample 5

1.7 Image Translation and Conditioning

Explored SDEdit-style image editing with varying noise levels [1, 3, 5, 7, 10, 20] and text conditioning.

Original Image
Campanile at i_start=1
Original Image
Campanile at i_start=3
Original Image
Campanile at i_start=5
Original Image
Campanile at i_start=7
Original Image
Campanile at i_start=10
Original Image
Campanile at i_start=20
Original Image
Campanile (original)
Original Image
Hand-drawn Image 1 at i_start=1
Original Image
Hand-drawn Image 1 at i_start=3
Original Image
Hand-drawn Image 1 at i_start=5
Original Image
Hand-drawn Image 1 at i_start=7
Original Image
Hand-drawn Image 1 at i_start=10
Original Image
Hand-drawn Image 1 at i_start=20
Original Image
Hand-drawn Image 1 (original)
Original Image
Hand-drawn Image 1 at i_start=1
Original Image
Hand-drawn Image 2 at i_start=3
Original Image
Hand-drawn Image 2 at i_start=5
Original Image
Hand-drawn Image 2 at i_start=7
Original Image
Hand-drawn Image 2 at i_start=10
Original Image
Hand-drawn Image 2 at i_start=20
Original Image
Hand-drawn Image 2 (original)
Original Image
Hand-drawn Image 1 at i_start=1
Original Image
Web Image at i_start=3
Original Image
Web Image at i_start=5
Original Image
Web Image at i_start=7
Original Image
Web Image at i_start=10
Original Image
Web Image at i_start=20
Original Image
Web Image (original)

Now we explored inpainting. Inpainting with diffusion models allows us to selectively modify parts of an image while preserving the rest. The process involves using a binary mask to specify which areas to edit (mask=1) and which to preserve (mask=0). At each denoising step, we generate new content for the masked regions while keeping the unmasked regions identical to the original image, adding the appropriate amount of noise for each timestep. This creates a seamless blend between the preserved and generated parts of the image. Below are the results of applying inpainting to our images:

Original Image
Campanile
Original Image
Mask for Campanile
Original Image
Area to replace
Original Image
Campanile Inpainted
Original Image
Qutb Minar
Original Image
Mask for Qutb
Original Image
Area to replace
Original Image
Qutb Inpainted
Original Image
Taj Mahal
Original Image
Mask for Campanile
Original Image
Area to replace
Original Image
Taj Inpainted

We also tried applying text-conditioned image to image translation. For the images of the campanile we used the prompt "a rocket ship" and for the Skyscraper Image we used "a lithograph of a waterfall." Below are the results:

Original Image
A rocket ship at noise level 1
Original Image
A rocket ship at noise level 3
Original Image
A rocket ship at noise level 5
Original Image
A rocket ship at noise level 7
Original Image
A rocket ship at noise level 10
Original Image
A rocket ship at noise level 20
Original Image
Campanile
Original Image
A lithograph of a waterfall at noise level 1
Original Image
A lithograph of a waterfall at noise level 3
Original Image
A lithograph of a waterfall at noise level 5
Original Image
A lithograph of a waterfall at noise level 7
Original Image
A lithograph of a waterfall at noise level 10
Original Image
A lithograph of a waterfall at noise level 20
Original Image
Skyscraper
Original Image
A rocket ship at noise level 1
Original Image
A rocket ship at noise level 3
Original Image
A rocket ship at noise level 5
Original Image
A rocket ship at noise level 7
Original Image
A rocket ship at noise level 10
Original Image
A rocket ship at noise level 20
Original Image
Qutb Minar

Visual Anagrams

Visual anagrams using diffusion models present an intriguing approach to creating dual-perspective images by cleverly combining noise estimates from two different prompts. Initially, I attempted to create these illusions by simply averaging the noise estimates from both orientations - one for the upright image ("an oil painting of people around a campfire") and one for the inverted image ("an oil painting of an old man"). However, this straightforward averaging approach produced results where neither perspective was clearly visible. Through experimentation, I discovered that adjusting the ratio of the noise estimates, giving slightly more weight (0.6) to the current orientation's noise estimate and less weight (0.4) to the flipped orientation's estimate, produced much more convincing results. This weighted approach helped maintain the clarity of each perspective while still preserving the dual-nature of the image, allowing viewers to see distinctly different scenes depending on the orientation.

Original Image
An Oil Painting of an Old Man (weight 0.7)
Original Image
An Oil Painting of People around a Campfire (weight 0.3)
Original Image
An Oil Painting of a Hot Air Balloon (weight 0.5)
Original Image
An Oil Painting of an Onion (weight 0.5)
Original Image
A lithograph of a skull (weight 0.5)
Original Image
A lithograph of a waterfall (weight 0.5)

Hybrid Images

In this section, we implemented Factorized Diffusion to create hybrid images by manipulating frequency components of noise estimates from different text prompts. Using a Gaussian blur with kernel size 33 and sigma 2, we separated noise estimates into low and high-frequency components. For example, in our skull/waterfall hybrid, we combined the low frequencies from the skull noise estimate with the high frequencies from the waterfall noise estimate. This created an image that appears as a skull when viewed from afar (where low frequencies dominate) and transforms into a waterfall when viewed up close (where high frequencies become visible). We also experimented with other pairs like lithograph of red panda and skull and watercolor painting of seahorse and flowers, demonstrating how this technique can create compelling dual-perception images by leveraging the frequency-based characteristics of human vision..

Original Image
Hybrid Image of the lithograph of a red panda and skull
Original Image
Hybrid Image of the lithograph of a skull and a waterfall
Original Image
Hybrid image of a watercolor painting of a seahorse and flowers

Part B: Building Diffusion Models from Scratch

In this part, we implement diffusion models from the ground up using the MNIST dataset, focusing on building and training various UNet architectures for denoising tasks.

1.1 Single-Step Denoising UNet

Implemented a UNet architecture with the following components:

  • Conv2d layers for maintaining resolution
  • DownConv blocks for 2x downsampling
  • UpConv blocks for 2x upsampling
  • Flatten/Unflatten operations for 7x7 tensor processing
  • Channel-wise concatenation using torch.cat()
UNet Architecture
UNet Architecture for Single-Step Denoising
UNet Architecture
Standard UNet Operations

1.2 Training the Denoiser

Implemented denoising using the L2 loss:

\[ L = E_{z,x}||D_\theta(z) - x||^2 \]
Noise Levels
Varying levels of noise on MNIST digits (σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0])
Training Loss
Training Loss Over Time
Training Loss
Results on digits from the test set after 1 epoch of training
Training Loss
Results on digits from the test set after 5 epochs of training

Out of Distribution Testing: our denoiser is trained on sigma = 0.5, Let's see how the denoiser performs on different sigmas that it wasn't trained for.

Training Loss
Results on digits from the test set with varying noise levels.

As we can see it does predict relatively well for other values of σ

3. Time-Conditioned UNet

Enhanced the UNet architecture with time conditioning using FCBlocks:

Time-Conditioned UNet
Time-Conditioned UNet Architecture
Time-Conditioned UNet
FCBlock for conditioning
5 Epoch Samples
Samples after 5 Epochs
20 Epoch Samples
Samples after 20 Epochs
20 Epoch Samples
Training Loss

4. Class-Conditioned UNet

We begin by adding class conditioning the the UNet structure and training the model for improved digit generation. The algorithm we used is shown in the figure below:

Time-Conditioned UNet
Algorithm B.3. Training class-conditioned UNet
5 Epoch Samples
Samples after 1 Epoch
5 Epoch Samples
Samples after 5 Epoch
20 Epoch Samples
Samples after 15 Epochs
20 Epoch Samples
Samples after 20 Epochs
20 Epoch Samples
Training Loss

Reflection

This project has provided a comprehensive exploration of diffusion models and their creative applications. We started with understanding the fundamental concepts of forward and reverse diffusion processes, implementing denoising techniques using both simple Gaussian blur and sophisticated UNet-based approaches. Through classifier-free guidance, we learned how to control image generation using text prompts, demonstrating the power of multimodal AI systems. The project then advanced to more creative applications: SDEdit showed us how to modify existing images with varying degrees of transformation, while inpainting revealed the model's ability to seamlessly fill in masked regions. The culmination in visual anagrams and hybrid images showcased how diffusion models can be manipulated to create sophisticated optical illusions. Throughout these experiments, we gained practical experience in handling GPU memory constraints, managing tensor operations, and fine-tuning parameters like noise levels and weights to achieve desired results. This project has demonstrated both the technical capabilities and creative potential of diffusion models in modern AI-driven image generation and manipulation. Moreover, The implementation highlighted critical aspects of training dynamics, such as the importance of step-by-step loss tracking, effective learning rate scheduling, and finding the right balance between batch size and model capacity. This practical experience with DDPM implementation enhanced our understanding of both the theoretical foundations and practical considerations necessary for building effective diffusion models.