Image Segmentation
Segmentation goes beyond bounding boxes to classify every pixel in an image, creating precise boundaries around objects and regions.
Types of Segmentation
| Type | Output | Distinguishes Instances? | Example |
|---|---|---|---|
| Semantic | Class label for every pixel | No — all cars are "car" | Road scene parsing (road, sidewalk, building, sky) |
| Instance | Class + unique ID per object | Yes — car-1, car-2, car-3 | Counting objects, tracking individuals |
| Panoptic | Combines semantic + instance | Yes for "things," no for "stuff" | Full scene understanding (sky=stuff, car=thing) |
U-Net Architecture
U-Net was originally designed for medical image segmentation and has become one of the most influential segmentation architectures. Its key innovation is the encoder-decoder structure with skip connections:
- Encoder (contracting path): Captures context through downsampling convolutions and pooling
- Decoder (expanding path): Recovers spatial information through upsampling and transposed convolutions
- Skip connections: Connect corresponding encoder and decoder layers, preserving fine-grained spatial details
import torch import torch.nn as nn import segmentation_models_pytorch as smp # Using segmentation_models_pytorch for easy U-Net model = smp.Unet( encoder_name="resnet34", encoder_weights="imagenet", in_channels=3, classes=5 # Number of segmentation classes ) # Forward pass x = torch.randn(1, 3, 256, 256) # Batch of 1 image output = model(x) print(output.shape) # torch.Size([1, 5, 256, 256]) # Each pixel has 5 class probabilities
Mask R-CNN
Mask R-CNN extends Faster R-CNN by adding a segmentation mask branch alongside the existing bounding box and classification branches. For each detected object, it predicts both a bounding box and a pixel-level mask.
from detectron2 import model_zoo from detectron2.engine import DefaultPredictor from detectron2.config import get_cfg import cv2 cfg = get_cfg() cfg.merge_from_file(model_zoo.get_config_file( "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml" )) cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url( "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml" ) cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5 predictor = DefaultPredictor(cfg) img = cv2.imread("photo.jpg") outputs = predictor(img) # outputs["instances"] contains boxes, classes, scores, and masks masks = outputs["instances"].pred_masks print(f"Found {len(masks)} instances")
DeepLab
DeepLab uses atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing resolution. DeepLabV3+ adds a decoder module for sharper boundaries.
SAM (Segment Anything Model)
SAM by Meta AI is a foundation model for segmentation. Trained on 11 million images with 1 billion masks, it can segment any object in any image with various types of prompts:
- Point prompts: Click on an object to segment it
- Box prompts: Draw a bounding box around the object
- Text prompts: Describe what to segment (SAM 2 and extensions)
- Automatic mode: Segment everything in the image
from segment_anything import SamPredictor, sam_model_registry # Load SAM model sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth") predictor = SamPredictor(sam) # Set image image = cv2.imread("photo.jpg") predictor.set_image(image) # Segment with point prompt input_point = np.array([[500, 375]]) # Click coordinates input_label = np.array([1]) # 1 = foreground masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True ) print(f"Generated {len(masks)} masks") print(f"Best mask score: {scores.max():.3f}")
Medical Image Segmentation
Medical imaging is one of the most impactful applications of segmentation:
- Tumor detection: Segmenting tumors in MRI and CT scans
- Organ segmentation: Delineating organs for surgical planning
- Cell segmentation: Identifying and counting cells in microscopy images
- Retinal analysis: Segmenting blood vessels and lesions in fundus images
Lilly Tech Systems