Advanced Computer Vision Topics
Beyond classification, detection, and segmentation lies a world of generative models, video understanding, 3D vision, and multimodal systems that push the boundaries of what machines can see and create.
Generative Models for Images
GANs (Generative Adversarial Networks)
GANs consist of two neural networks competing against each other: a generator that creates fake images, and a discriminator that tries to tell real images from fakes. Through this adversarial training, the generator learns to produce increasingly realistic images.
- StyleGAN: Generates photorealistic faces with control over style attributes (age, hair, expression)
- CycleGAN: Translates images between domains (photos to paintings, horses to zebras)
- Pix2Pix: Paired image-to-image translation (sketches to photos, day to night)
Diffusion Models
Diffusion models learn to generate images by gradually removing noise. They start with pure noise and iteratively denoise it into a coherent image, guided by text prompts or other conditions. They have largely supplanted GANs for image generation due to their superior quality and stability.
VAEs (Variational Autoencoders)
VAEs learn a compressed latent representation of images and can generate new images by sampling from the learned latent space. They are used in the latent space of many diffusion models.
Image Generation
| Model | Creator | Key Feature |
|---|---|---|
| Stable Diffusion | Stability AI | Open-source, runs locally, highly customizable with LoRA and ControlNet |
| DALL-E | OpenAI | Strong text understanding, inpainting and outpainting capabilities |
| Midjourney | Midjourney | Exceptional aesthetic quality, strong artistic style generation |
| Imagen / Gemini | High photorealism and text rendering in images |
Video Understanding
Video adds the temporal dimension to computer vision:
- Action Recognition: Classifying activities in video clips (running, cooking, dancing). Models: SlowFast, Video Swin Transformer.
- Object Tracking: Following objects across video frames. Approaches include SORT, DeepSORT, and ByteTrack.
- Video Segmentation: Tracking and segmenting objects frame by frame. SAM 2 extends SAM to video.
- Video Generation: Creating video from text descriptions (Sora, Runway Gen-3).
- Temporal Understanding: Understanding events, causality, and narrative in video sequences.
3D Computer Vision
- Depth Estimation: Predicting depth from a single image (monocular) or multiple images (stereo). Models like MiDaS and Depth Anything.
- Point Cloud Processing: Working with 3D point data from LiDAR sensors. PointNet and its successors process raw 3D coordinates.
- 3D Reconstruction: Building 3D models from 2D images. Neural Radiance Fields (NeRF) and Gaussian Splatting create photorealistic 3D scenes from photos.
- Multi-view Geometry: Using multiple camera views to understand 3D structure (Structure from Motion, SLAM).
Pose Estimation
Detecting and tracking human body keypoints (joints, limbs) in images and video:
- 2D Pose: Estimating joint positions in the image plane. Models: OpenPose, MediaPipe, HRNet.
- 3D Pose: Estimating 3D joint positions from 2D images.
- Applications: Sports analytics, fitness tracking, sign language recognition, animation, and motion capture.
OCR (Optical Character Recognition)
Extracting text from images and documents:
- Traditional OCR: Tesseract, the most widely used open-source OCR engine
- Scene Text Detection: Finding text in natural images (street signs, product labels). Models: EAST, CRAFT.
- Document AI: Understanding document structure, extracting tables, forms, and key-value pairs. Tools: LayoutLM, Donut.
- Handwriting Recognition: Reading handwritten text, a particularly challenging variant.
import easyocr # Initialize reader (downloads models on first run) reader = easyocr.Reader(['en']) # Read text from image results = reader.readtext("sign.jpg") for bbox, text, confidence in results: print(f"Text: '{text}' (confidence: {confidence:.2f})")
Multi-Modal Vision-Language Models
Models that combine visual and textual understanding:
| Model | Capability |
|---|---|
| CLIP (OpenAI) | Learns visual concepts from natural language descriptions. Enables zero-shot image classification by comparing image and text embeddings. |
| LLaVA | Vision-language model that can discuss images, answer visual questions, and follow multimodal instructions. |
| GPT-4V / Claude Vision | Large language models with vision capabilities. Can analyze, describe, and reason about images. |
| Florence-2 | Microsoft's unified vision model handling captioning, detection, segmentation, and OCR. |
from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") image = Image.open("photo.jpg") labels = ["a cat", "a dog", "a bird", "a car"] inputs = processor(text=labels, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1)[0] for label, prob in zip(labels, probs): print(f"{label}: {prob:.3f}")
Lilly Tech Systems