How Deepfakes Work
Understanding deepfake generation techniques is essential for building effective detectors. Each generation method leaves different artifacts that detectors can exploit. This lesson covers the major approaches behind modern deepfakes.
Autoencoder-Based Face Swap
The original deepfake method uses a shared encoder with separate decoders for each face. The encoder learns a shared facial representation, while the decoders reconstruct each person's face:
Training
Train a shared encoder and two separate decoders on face images of person A and person B. The shared encoder forces both faces into a common latent space.
Swapping
To make person A look like person B: encode person A's face with the shared encoder, then decode with person B's decoder. The result has person B's appearance with person A's expression and pose.
Blending
The generated face is blended back into the original frame using face alignment and color matching.
GAN-Based Generation
Generative Adversarial Networks produce higher-quality deepfakes through a generator-discriminator competition:
- StyleGAN: Generates photorealistic faces from scratch. Used for "this person does not exist" type fakes.
- CycleGAN: Translates between face domains without paired training data. Useful for face reenactment.
- StarGAN: Multi-domain face attribute transfer — changing age, gender, expression, or hairstyle.
- FSGAN: Face swapping and reenactment that works on any face without subject-specific training.
Diffusion Model-Based Generation
Modern diffusion models have surpassed GANs in image quality and are increasingly used for deepfake creation:
- Stable Diffusion inpainting: Replace faces by masking and regenerating the face region with a specific identity
- ControlNet: Condition generation on face landmarks, preserving pose while changing identity
- IP-Adapter: Transfer identity from a reference image to a generated image
- InstantID / PhotoMaker: Single-image identity transfer with high fidelity
Lip Sync Deepfakes
A specialized category where the mouth region is modified to match different audio:
- Wav2Lip: Given audio and a face video, generates realistic lip movements that match the audio
- Video Rewrite: Earlier approach that warps mouth regions to match target phonemes
- Detection challenge: Only a small region (mouth) is modified, making detection harder than full face swaps
Common Artifacts by Generation Method
| Method | Common Artifacts | Detection Approach |
|---|---|---|
| Autoencoder | Blending boundaries, color mismatch, blur | Edge analysis, color consistency |
| GAN | Spectral artifacts, checkerboard patterns, asymmetry | Frequency analysis, symmetry checks |
| Diffusion | Subtle texture inconsistencies, identity bleed | Texture analysis, ML classifiers |
| Lip sync | Mouth boundary artifacts, teeth rendering issues | Lip-audio sync analysis |