Generative Models
10 interview questions on the models behind DALL-E, Stable Diffusion, ChatGPT, and modern AI art. Increasingly common in interviews as generative AI becomes central to the industry.
Q1: Explain how GANs work. What is the minimax game between generator and discriminator?
A GAN consists of two networks trained adversarially:
Generator G: Takes random noise z ~ N(0, 1) and produces fake data G(z). Goal: fool the discriminator into thinking generated data is real.
Discriminator D: Takes either real data x or fake data G(z) and predicts real (1) or fake (0). Goal: correctly distinguish real from fake.
Minimax objective: min_G max_D [ E[log D(x)] + E[log(1 - D(G(z)))] ]. The discriminator tries to maximize (correctly classify), while the generator tries to minimize (fool the discriminator). At the Nash equilibrium, the generator produces data indistinguishable from real data, and the discriminator outputs 0.5 for everything.
Training alternation: 1) Update D on a batch of real and fake data. 2) Update G to maximize D(G(z)). In practice, G minimizes -log(D(G(z))) instead of log(1-D(G(z))) to avoid vanishing gradients early in training.
import torch
import torch.nn as nn
class Generator(nn.Module):
def __init__(self, latent_dim=100, img_channels=1, img_size=28):
super().__init__()
self.net = nn.Sequential(
nn.Linear(latent_dim, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 1024), nn.LeakyReLU(0.2),
nn.Linear(1024, img_channels * img_size * img_size), nn.Tanh()
)
self.img_shape = (img_channels, img_size, img_size)
def forward(self, z):
return self.net(z).view(-1, *self.img_shape)
class Discriminator(nn.Module):
def __init__(self, img_channels=1, img_size=28):
super().__init__()
self.net = nn.Sequential(
nn.Flatten(),
nn.Linear(img_channels * img_size * img_size, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1), nn.Sigmoid()
)
def forward(self, img):
return self.net(img)
# Training step
def train_gan_step(G, D, real_imgs, latent_dim, opt_G, opt_D):
batch_size = real_imgs.size(0)
real_labels = torch.ones(batch_size, 1, device=real_imgs.device)
fake_labels = torch.zeros(batch_size, 1, device=real_imgs.device)
criterion = nn.BCELoss()
# Train Discriminator
z = torch.randn(batch_size, latent_dim, device=real_imgs.device)
fake_imgs = G(z).detach()
d_loss = criterion(D(real_imgs), real_labels) + criterion(D(fake_imgs), fake_labels)
opt_D.zero_grad(); d_loss.backward(); opt_D.step()
# Train Generator
z = torch.randn(batch_size, latent_dim, device=real_imgs.device)
fake_imgs = G(z)
g_loss = criterion(D(fake_imgs), real_labels) # Fool discriminator
opt_G.zero_grad(); g_loss.backward(); opt_G.step()
return g_loss.item(), d_loss.item()
Q2: What is mode collapse in GANs? How do you detect and mitigate it?
Mode collapse occurs when the generator produces only a small subset of the data distribution's modes. For example, a GAN trained on MNIST digits might only generate 3s and 7s, ignoring other digits. The generator finds a few outputs that fool the discriminator and exploits them.
Detection:
- Generate a large batch and visually inspect for diversity
- Measure the standard deviation of generated samples (low = collapsed)
- Compute class distributions if labels are available
- Monitor the discriminator loss — if it goes to zero, it has won and the generator may be collapsing
Mitigations:
- Wasserstein GAN (WGAN): Uses Wasserstein distance instead of JS divergence. Provides meaningful gradients even when distributions do not overlap. Most effective single fix.
- Spectral normalization: Constrains the Lipschitz constant of the discriminator, stabilizing training.
- Minibatch discrimination: The discriminator looks at batches of samples, not individual ones, detecting lack of diversity.
- Unrolled GANs: Generator optimizes against future discriminator updates, preventing mode collapse.
Q3: How do Variational Autoencoders (VAEs) work? What is the reparameterization trick?
VAE architecture: Encoder maps input x to a distribution q(z|x) = N(mu, sigma^2) in latent space. Decoder maps sampled z back to reconstructed x. Trained to minimize reconstruction loss + KL divergence between q(z|x) and prior p(z) = N(0, 1).
Loss: L = E[||x - decoder(z)||^2] + KL(q(z|x) || p(z)). The first term ensures good reconstructions. The second term regularizes the latent space to be smooth and continuous, enabling generation by sampling z ~ N(0, 1).
Reparameterization trick: We cannot backpropagate through the sampling operation z ~ N(mu, sigma^2). Instead, sample epsilon ~ N(0, 1) and compute z = mu + sigma * epsilon. Now mu and sigma receive gradients through the deterministic computation, while the stochasticity comes from epsilon.
VAE vs GAN: VAEs produce blurrier samples but have stable training, a meaningful latent space (interpolation works smoothly), and an explicit likelihood objective. GANs produce sharper samples but suffer from training instability and mode collapse.
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=256, latent_dim=32):
super().__init__()
# Encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
)
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
# Decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, input_dim), nn.Sigmoid(),
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_logvar(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std) # Sample from N(0, 1)
return mu + eps * std # z = mu + sigma * epsilon
def forward(self, x):
mu, logvar = self.encode(x.view(-1, 784))
z = self.reparameterize(mu, logvar)
return self.decoder(z), mu, logvar
def vae_loss(recon_x, x, mu, logvar):
recon_loss = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction='sum')
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_loss
Q4: How do diffusion models work? Walk through the forward and reverse processes.
Forward process (adding noise): Gradually add Gaussian noise to a clean image over T steps (typically T=1000). At each step t: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * epsilon, where epsilon ~ N(0, I). After many steps, x_T is approximately pure Gaussian noise. This is a fixed process (no learnable parameters).
Reverse process (denoising): A neural network (typically a U-Net) learns to reverse each noising step. Given noisy image x_t and timestep t, predict the noise epsilon that was added: epsilon_theta(x_t, t). The network is trained with a simple MSE loss: L = ||epsilon - epsilon_theta(x_t, t)||^2.
Sampling (generation): Start from pure noise x_T ~ N(0, I). For each step from T to 1, predict the noise and subtract it: x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (beta_t/sqrt(1-alpha_bar_t)) * epsilon_theta(x_t, t)) + sigma_t * z.
Why diffusion models work better than GANs: Stable training (no adversarial game), no mode collapse, explicit likelihood objective, and the denoising formulation is well-suited for U-Net architectures that already excel at image-to-image tasks.
Q5: What is classifier-free guidance? How does it improve conditional generation?
Problem: Conditional diffusion models (e.g., text-to-image) need to balance sample quality with diversity. Without guidance, samples are diverse but may not match the conditioning well.
Classifier-free guidance: During training, randomly drop the conditioning (e.g., text prompt) with probability p (e.g., 10%). The model learns both conditional epsilon_theta(x_t, t, c) and unconditional epsilon_theta(x_t, t) generation.
At inference: epsilon_guided = epsilon_unconditional + w * (epsilon_conditional - epsilon_unconditional), where w is the guidance scale (typically 7-15 for text-to-image). Higher w produces samples that more closely match the conditioning but with less diversity.
Why "classifier-free": The original approach (classifier guidance) required training a separate classifier on noisy images. Classifier-free guidance achieves the same effect without a separate classifier, by training the diffusion model itself to work both conditionally and unconditionally.
Q6: How do autoregressive models generate data? Compare with diffusion models.
Autoregressive models factor the joint probability as a product of conditionals: p(x) = p(x_1) * p(x_2|x_1) * p(x_3|x_1,x_2) * ... Each element is generated one at a time, conditioned on all previous elements. Examples: GPT (text tokens), PixelCNN (image pixels), WaveNet (audio samples).
| Aspect | Autoregressive | Diffusion |
|---|---|---|
| Generation | Sequential (one token at a time) | Iterative (denoise full image T times) |
| Speed | O(n) steps for n tokens | O(T) steps (T=20-1000) for entire image |
| Likelihood | Exact log-likelihood | Variational lower bound (ELBO) |
| Best at | Text, code, sequential data | Images, video, audio |
| Architecture | Transformer decoder (GPT) | U-Net or DiT (Diffusion Transformer) |
Q7: What is FID (Frechet Inception Distance) and how does it evaluate generative models?
FID measures the distance between the distribution of generated images and real images in a feature space (the penultimate layer of InceptionV3).
Computation: 1) Extract Inception features for both real and generated images. 2) Fit Gaussian distributions to each set: (mu_r, Sigma_r) and (mu_g, Sigma_g). 3) FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r * Sigma_g)^(1/2)).
Interpretation: Lower FID = better. FID=0 means identical distributions. Good image models achieve FID < 10 on standard benchmarks.
Limitations: Relies on InceptionV3 features (may miss textures it does not represent well), needs at least ~10K samples for stable estimates, only captures first two moments (mean and covariance) of the distributions.
Inception Score (IS): Measures quality (confident predictions) and diversity (uniform class distribution) using IS = exp(E[KL(p(y|x) || p(y))]). Higher is better. Less commonly used than FID because it does not compare against real data.
Q8: What is latent diffusion (Stable Diffusion)? Why work in latent space?
Problem with pixel-space diffusion: Running diffusion on 512x512x3 images is extremely expensive. Each denoising step processes 786K-dimensional vectors.
Latent diffusion: First train a VAE to compress images into a much smaller latent space (e.g., 512x512 → 64x64x4 = 16K dimensions, a 50x reduction). Then run the diffusion process in this latent space. Finally, decode the denoised latents back to pixel space.
Stable Diffusion architecture:
- VAE encoder/decoder: Compresses/decompresses images (trained separately)
- U-Net: Denoises in latent space, conditioned on timestep and text embeddings
- Text encoder (CLIP): Encodes text prompts, injected via cross-attention layers in the U-Net
Benefits: 10-100x cheaper to train and sample than pixel-space diffusion, while maintaining high quality. This is what made text-to-image generation practical.
Q9: Compare GANs, VAEs, diffusion models, and autoregressive models. When would you choose each?
| Model | Strengths | Weaknesses | Best For |
|---|---|---|---|
| GANs | Fast sampling, sharp images | Training instability, mode collapse, no likelihood | Real-time image synthesis, style transfer, super-resolution |
| VAEs | Stable training, smooth latent space, explicit ELBO | Blurry samples, posterior collapse | Latent space interpolation, anomaly detection, representation learning |
| Diffusion | Best image quality, stable training, no mode collapse | Slow sampling (many denoising steps), expensive training | Text-to-image, inpainting, video generation, audio synthesis |
| Autoregressive | Exact likelihood, flexible conditioning, scales well | Sequential generation (slow for images), exposure bias | Text generation (GPT), code completion, speech synthesis |
Current trends (2024-2025): Diffusion dominates image/video generation. Autoregressive (Transformers) dominates text. Flow matching is emerging as a simpler alternative to diffusion. Some models combine approaches (autoregressive tokens that control a diffusion process).
Q10: What is flow matching and how does it relate to diffusion models?
Flow matching is a simpler framework for generative modeling that constructs a continuous path (flow) between noise and data distributions. Instead of the noising/denoising formulation of diffusion, flow matching directly learns a velocity field that transforms noise into data.
Key idea: Define a path x_t = (1-t) * epsilon + t * x from noise epsilon (at t=0) to data x (at t=1). Train a network v_theta(x_t, t) to predict the velocity dx_t/dt = x - epsilon along this path.
Advantages over diffusion:
- Simpler formulation (no noise schedule to design)
- Straighter paths enable fewer sampling steps
- Easier to understand and implement
- Works well with ODE solvers for deterministic sampling
Used in: Stable Diffusion 3, Flux, and other recent image generation models. Increasingly replacing the DDPM-style diffusion formulation.
Key Takeaways
- GANs use adversarial training (minimax game); mode collapse is their main failure mode; WGAN and spectral norm help
- VAEs learn a smooth latent space via the reparameterization trick; the ELBO = reconstruction + KL divergence
- Diffusion models add noise then learn to denoise; stable training, no mode collapse, best image quality
- Classifier-free guidance improves conditional generation by training both conditional and unconditional
- Latent diffusion (Stable Diffusion) runs diffusion in compressed latent space for 50x efficiency
- FID measures distribution distance in Inception feature space; lower is better
- Flow matching is a simpler alternative to diffusion gaining adoption in latest models
Lilly Tech Systems