Advanced

Generative Models

10 interview questions on the models behind DALL-E, Stable Diffusion, ChatGPT, and modern AI art. Increasingly common in interviews as generative AI becomes central to the industry.

Q1: Explain how GANs work. What is the minimax game between generator and discriminator?

A GAN consists of two networks trained adversarially:

Generator G: Takes random noise z ~ N(0, 1) and produces fake data G(z). Goal: fool the discriminator into thinking generated data is real.

Discriminator D: Takes either real data x or fake data G(z) and predicts real (1) or fake (0). Goal: correctly distinguish real from fake.

Minimax objective: min_G max_D [ E[log D(x)] + E[log(1 - D(G(z)))] ]. The discriminator tries to maximize (correctly classify), while the generator tries to minimize (fool the discriminator). At the Nash equilibrium, the generator produces data indistinguishable from real data, and the discriminator outputs 0.5 for everything.

Training alternation: 1) Update D on a batch of real and fake data. 2) Update G to maximize D(G(z)). In practice, G minimizes -log(D(G(z))) instead of log(1-D(G(z))) to avoid vanishing gradients early in training.

import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_channels=1, img_size=28):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 1024), nn.LeakyReLU(0.2),
            nn.Linear(1024, img_channels * img_size * img_size), nn.Tanh()
        )
        self.img_shape = (img_channels, img_size, img_size)

    def forward(self, z):
        return self.net(z).view(-1, *self.img_shape)

class Discriminator(nn.Module):
    def __init__(self, img_channels=1, img_size=28):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(img_channels * img_size * img_size, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid()
        )

    def forward(self, img):
        return self.net(img)

# Training step
def train_gan_step(G, D, real_imgs, latent_dim, opt_G, opt_D):
    batch_size = real_imgs.size(0)
    real_labels = torch.ones(batch_size, 1, device=real_imgs.device)
    fake_labels = torch.zeros(batch_size, 1, device=real_imgs.device)
    criterion = nn.BCELoss()

    # Train Discriminator
    z = torch.randn(batch_size, latent_dim, device=real_imgs.device)
    fake_imgs = G(z).detach()
    d_loss = criterion(D(real_imgs), real_labels) + criterion(D(fake_imgs), fake_labels)
    opt_D.zero_grad(); d_loss.backward(); opt_D.step()

    # Train Generator
    z = torch.randn(batch_size, latent_dim, device=real_imgs.device)
    fake_imgs = G(z)
    g_loss = criterion(D(fake_imgs), real_labels)  # Fool discriminator
    opt_G.zero_grad(); g_loss.backward(); opt_G.step()

    return g_loss.item(), d_loss.item()

Q2: What is mode collapse in GANs? How do you detect and mitigate it?

Mode collapse occurs when the generator produces only a small subset of the data distribution's modes. For example, a GAN trained on MNIST digits might only generate 3s and 7s, ignoring other digits. The generator finds a few outputs that fool the discriminator and exploits them.

Detection:

Generate a large batch and visually inspect for diversity
Measure the standard deviation of generated samples (low = collapsed)
Compute class distributions if labels are available
Monitor the discriminator loss — if it goes to zero, it has won and the generator may be collapsing

Mitigations:

Wasserstein GAN (WGAN): Uses Wasserstein distance instead of JS divergence. Provides meaningful gradients even when distributions do not overlap. Most effective single fix.
Spectral normalization: Constrains the Lipschitz constant of the discriminator, stabilizing training.
Minibatch discrimination: The discriminator looks at batches of samples, not individual ones, detecting lack of diversity.
Unrolled GANs: Generator optimizes against future discriminator updates, preventing mode collapse.

Q3: How do Variational Autoencoders (VAEs) work? What is the reparameterization trick?

VAE architecture: Encoder maps input x to a distribution q(z|x) = N(mu, sigma^2) in latent space. Decoder maps sampled z back to reconstructed x. Trained to minimize reconstruction loss + KL divergence between q(z|x) and prior p(z) = N(0, 1).

Loss: L = E[||x - decoder(z)||^2] + KL(q(z|x) || p(z)). The first term ensures good reconstructions. The second term regularizes the latent space to be smooth and continuous, enabling generation by sampling z ~ N(0, 1).

Reparameterization trick: We cannot backpropagate through the sampling operation z ~ N(mu, sigma^2). Instead, sample epsilon ~ N(0, 1) and compute z = mu + sigma * epsilon. Now mu and sigma receive gradients through the deterministic computation, while the stochasticity comes from epsilon.

VAE vs GAN: VAEs produce blurrier samples but have stable training, a meaningful latent space (interpolation works smoothly), and an explicit likelihood objective. GANs produce sharper samples but suffer from training instability and mode collapse.

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, latent_dim=32):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)      # Sample from N(0, 1)
        return mu + eps * std             # z = mu + sigma * epsilon

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        return self.decoder(z), mu, logvar

def vae_loss(recon_x, x, mu, logvar):
    recon_loss = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl_loss

Q4: How do diffusion models work? Walk through the forward and reverse processes.

Forward process (adding noise): Gradually add Gaussian noise to a clean image over T steps (typically T=1000). At each step t: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * epsilon, where epsilon ~ N(0, I). After many steps, x_T is approximately pure Gaussian noise. This is a fixed process (no learnable parameters).

Reverse process (denoising): A neural network (typically a U-Net) learns to reverse each noising step. Given noisy image x_t and timestep t, predict the noise epsilon that was added: epsilon_theta(x_t, t). The network is trained with a simple MSE loss: L = ||epsilon - epsilon_theta(x_t, t)||^2.

Sampling (generation): Start from pure noise x_T ~ N(0, I). For each step from T to 1, predict the noise and subtract it: x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (beta_t/sqrt(1-alpha_bar_t)) * epsilon_theta(x_t, t)) + sigma_t * z.

Why diffusion models work better than GANs: Stable training (no adversarial game), no mode collapse, explicit likelihood objective, and the denoising formulation is well-suited for U-Net architectures that already excel at image-to-image tasks.

Q5: What is classifier-free guidance? How does it improve conditional generation?

Problem: Conditional diffusion models (e.g., text-to-image) need to balance sample quality with diversity. Without guidance, samples are diverse but may not match the conditioning well.

Classifier-free guidance: During training, randomly drop the conditioning (e.g., text prompt) with probability p (e.g., 10%). The model learns both conditional epsilon_theta(x_t, t, c) and unconditional epsilon_theta(x_t, t) generation.

At inference: epsilon_guided = epsilon_unconditional + w * (epsilon_conditional - epsilon_unconditional), where w is the guidance scale (typically 7-15 for text-to-image). Higher w produces samples that more closely match the conditioning but with less diversity.

Why "classifier-free": The original approach (classifier guidance) required training a separate classifier on noisy images. Classifier-free guidance achieves the same effect without a separate classifier, by training the diffusion model itself to work both conditionally and unconditionally.

Q6: How do autoregressive models generate data? Compare with diffusion models.

Autoregressive models factor the joint probability as a product of conditionals: p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... Each element is generated one at a time, conditioned on all previous elements. Examples: GPT (text tokens), PixelCNN (image pixels), WaveNet (audio samples).

Aspect	Autoregressive	Diffusion
Generation	Sequential (one token at a time)	Iterative (denoise full image T times)
Speed	O(n) steps for n tokens	O(T) steps (T=20-1000) for entire image
Likelihood	Exact log-likelihood	Variational lower bound (ELBO)
Best at	Text, code, sequential data	Images, video, audio
Architecture	Transformer decoder (GPT)	U-Net or DiT (Diffusion Transformer)

Q7: What is FID (Frechet Inception Distance) and how does it evaluate generative models?

FID measures the distance between the distribution of generated images and real images in a feature space (the penultimate layer of InceptionV3).

Computation: 1) Extract Inception features for both real and generated images. 2) Fit Gaussian distributions to each set: (mu_r, Sigma_r) and (mu_g, Sigma_g). 3) FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r * Sigma_g)^(1/2)).

Interpretation: Lower FID = better. FID=0 means identical distributions. Good image models achieve FID < 10 on standard benchmarks.

Limitations: Relies on InceptionV3 features (may miss textures it does not represent well), needs at least ~10K samples for stable estimates, only captures first two moments (mean and covariance) of the distributions.

Inception Score (IS): Measures quality (confident predictions) and diversity (uniform class distribution) using IS = exp(E[KL(p(y|x) || p(y))]). Higher is better. Less commonly used than FID because it does not compare against real data.

Q8: What is latent diffusion (Stable Diffusion)? Why work in latent space?

Problem with pixel-space diffusion: Running diffusion on 512x512x3 images is extremely expensive. Each denoising step processes 786K-dimensional vectors.

Latent diffusion: First train a VAE to compress images into a much smaller latent space (e.g., 512x512 → 64x64x4 = 16K dimensions, a 50x reduction). Then run the diffusion process in this latent space. Finally, decode the denoised latents back to pixel space.

Stable Diffusion architecture:

VAE encoder/decoder: Compresses/decompresses images (trained separately)
U-Net: Denoises in latent space, conditioned on timestep and text embeddings
Text encoder (CLIP): Encodes text prompts, injected via cross-attention layers in the U-Net

Benefits: 10-100x cheaper to train and sample than pixel-space diffusion, while maintaining high quality. This is what made text-to-image generation practical.

Q9: Compare GANs, VAEs, diffusion models, and autoregressive models. When would you choose each?

Model	Strengths	Weaknesses	Best For
GANs	Fast sampling, sharp images	Training instability, mode collapse, no likelihood	Real-time image synthesis, style transfer, super-resolution
VAEs	Stable training, smooth latent space, explicit ELBO	Blurry samples, posterior collapse	Latent space interpolation, anomaly detection, representation learning
Diffusion	Best image quality, stable training, no mode collapse	Slow sampling (many denoising steps), expensive training	Text-to-image, inpainting, video generation, audio synthesis
Autoregressive	Exact likelihood, flexible conditioning, scales well	Sequential generation (slow for images), exposure bias	Text generation (GPT), code completion, speech synthesis

Current trends (2024-2025): Diffusion dominates image/video generation. Autoregressive (Transformers) dominates text. Flow matching is emerging as a simpler alternative to diffusion. Some models combine approaches (autoregressive tokens that control a diffusion process).

Q10: What is flow matching and how does it relate to diffusion models?

Flow matching is a simpler framework for generative modeling that constructs a continuous path (flow) between noise and data distributions. Instead of the noising/denoising formulation of diffusion, flow matching directly learns a velocity field that transforms noise into data.

Key idea: Define a path x_t = (1-t) * epsilon + t * x from noise epsilon (at t=0) to data x (at t=1). Train a network v_theta(x_t, t) to predict the velocity dx_t/dt = x - epsilon along this path.

Advantages over diffusion:

Simpler formulation (no noise schedule to design)
Straighter paths enable fewer sampling steps
Easier to understand and implement
Works well with ODE solvers for deterministic sampling

Used in: Stable Diffusion 3, Flux, and other recent image generation models. Increasingly replacing the DDPM-style diffusion formulation.

Key Takeaways

💡

GANs use adversarial training (minimax game); mode collapse is their main failure mode; WGAN and spectral norm help
VAEs learn a smooth latent space via the reparameterization trick; the ELBO = reconstruction + KL divergence
Diffusion models add noise then learn to denoise; stable training, no mode collapse, best image quality
Classifier-free guidance improves conditional generation by training both conditional and unconditional
Latent diffusion (Stable Diffusion) runs diffusion in compressed latent space for 50x efficiency
FID measures distribution distance in Inception feature space; lower is better
Flow matching is a simpler alternative to diffusion gaining adoption in latest models

← Previous Training & Optimization Next → Practice Questions & Tips