Intermediate

How Stable Diffusion Works

Understand the diffusion process, the key components (U-Net, VAE, CLIP), and why working in latent space makes it all possible on consumer hardware.

The Core Idea: Diffusion

Diffusion models learn by destroying and reconstructing images. During training, the model gradually adds noise to images until they become pure static. Then it learns to reverse the process — starting from random noise and step by step recovering a clean image.

The Diffusion Process
Forward Process (Training):
Clean Image -> Add Noise -> More Noise -> ... -> Pure Noise
   step 0        step 1       step 2          step T

Reverse Process (Generation):
Pure Noise -> Remove Noise -> Less Noise -> ... -> Clean Image
   step T      step T-1       step T-2         step 0

The model learns to predict and remove noise at each step,
guided by the text prompt you provide.

Key Components

1. CLIP Text Encoder

CLIP (Contrastive Language-Image Pre-training) converts your text prompt into a numerical representation (embedding) that the model can understand. It was trained on millions of image-text pairs to understand the relationship between words and visual concepts.

2. U-Net (Noise Predictor)

The U-Net is the core of the model. At each denoising step, it takes the current noisy image and the text embedding, and predicts the noise to remove. It has an encoder-decoder architecture with skip connections that preserve fine details.

3. VAE (Variational Autoencoder)

The VAE compresses images from pixel space (512x512x3 = 786,432 values) to a much smaller latent space (64x64x4 = 16,384 values) — a 48x compression. This is what makes Stable Diffusion run on consumer GPUs.

The Generation Pipeline

Step-by-Step Generation
Step 1: Text Encoding
  Your prompt -> CLIP -> Text embeddings (77x768 matrix)

Step 2: Initial Noise
  Generate random noise in latent space (64x64x4)
  The seed number determines this starting noise

Step 3: Iterative Denoising (20-50 steps)
  For each step:
    U-Net predicts noise in current latent
    Scheduler removes predicted noise
    Text embeddings guide what emerges
    CFG scale controls how much to follow the prompt

Step 4: VAE Decoding
  Denoised latent (64x64x4) -> VAE Decoder -> Image (512x512x3)

Latent Space: The Key Innovation

The word "Latent" in Latent Diffusion Model is critical. Instead of denoising in full pixel space (which would require enormous compute), the entire diffusion process happens in a compressed latent space. This is why Stable Diffusion can run on an 8GB GPU while producing 512x512 images.

Classifier-Free Guidance (CFG)

The CFG scale (typically 7-12) controls how closely the model follows your prompt. At each denoising step, the model generates two predictions: one with your prompt and one without. The difference is amplified by the CFG scale:

  • CFG 1: Model mostly ignores your prompt, creative but unpredictable
  • CFG 7: Good balance of creativity and prompt adherence (recommended default)
  • CFG 15+: Very literal interpretation, can become oversaturated or distorted
💡
Key takeaway: Stable Diffusion works by starting from noise and iteratively removing it, guided by your text prompt. The latent space compression is what makes it fast enough to run on consumer hardware.

What's Next?

Now that you understand how the model works, the next lesson covers prompt crafting — how to write effective prompts that produce the images you envision.