Intermediate

Image Segmentation

Segmentation goes beyond bounding boxes to classify every pixel in an image, creating precise boundaries around objects and regions.

Types of Segmentation

TypeOutputDistinguishes Instances?Example
SemanticClass label for every pixelNo — all cars are "car"Road scene parsing (road, sidewalk, building, sky)
InstanceClass + unique ID per objectYes — car-1, car-2, car-3Counting objects, tracking individuals
PanopticCombines semantic + instanceYes for "things," no for "stuff"Full scene understanding (sky=stuff, car=thing)

U-Net Architecture

U-Net was originally designed for medical image segmentation and has become one of the most influential segmentation architectures. Its key innovation is the encoder-decoder structure with skip connections:

  • Encoder (contracting path): Captures context through downsampling convolutions and pooling
  • Decoder (expanding path): Recovers spatial information through upsampling and transposed convolutions
  • Skip connections: Connect corresponding encoder and decoder layers, preserving fine-grained spatial details
Python - Simple U-Net (PyTorch)
import torch
import torch.nn as nn
import segmentation_models_pytorch as smp

# Using segmentation_models_pytorch for easy U-Net
model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=3,
    classes=5  # Number of segmentation classes
)

# Forward pass
x = torch.randn(1, 3, 256, 256)  # Batch of 1 image
output = model(x)
print(output.shape)  # torch.Size([1, 5, 256, 256])
# Each pixel has 5 class probabilities

Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding a segmentation mask branch alongside the existing bounding box and classification branches. For each detected object, it predicts both a bounding box and a pixel-level mask.

Python - Mask R-CNN with Detectron2
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
import cv2

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

predictor = DefaultPredictor(cfg)
img = cv2.imread("photo.jpg")
outputs = predictor(img)

# outputs["instances"] contains boxes, classes, scores, and masks
masks = outputs["instances"].pred_masks
print(f"Found {len(masks)} instances")

DeepLab

DeepLab uses atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing resolution. DeepLabV3+ adds a decoder module for sharper boundaries.

SAM (Segment Anything Model)

SAM by Meta AI is a foundation model for segmentation. Trained on 11 million images with 1 billion masks, it can segment any object in any image with various types of prompts:

  • Point prompts: Click on an object to segment it
  • Box prompts: Draw a bounding box around the object
  • Text prompts: Describe what to segment (SAM 2 and extensions)
  • Automatic mode: Segment everything in the image
Python - SAM Usage
from segment_anything import SamPredictor, sam_model_registry

# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
image = cv2.imread("photo.jpg")
predictor.set_image(image)

# Segment with point prompt
input_point = np.array([[500, 375]])  # Click coordinates
input_label = np.array([1])           # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

print(f"Generated {len(masks)} masks")
print(f"Best mask score: {scores.max():.3f}")

Medical Image Segmentation

Medical imaging is one of the most impactful applications of segmentation:

  • Tumor detection: Segmenting tumors in MRI and CT scans
  • Organ segmentation: Delineating organs for surgical planning
  • Cell segmentation: Identifying and counting cells in microscopy images
  • Retinal analysis: Segmenting blood vessels and lesions in fundus images
💡
Note: Medical image segmentation requires extra care with data privacy, annotation quality (typically done by medical professionals), and thorough validation before clinical deployment.
Key takeaway: Segmentation provides pixel-level understanding of images. U-Net remains a strong baseline, Mask R-CNN handles instance segmentation, and SAM is a game-changer as a foundation model that can segment anything. Choose the right type (semantic, instance, panoptic) based on your application needs.