Intermediate

Image Segmentation

Segmentation goes beyond bounding boxes to classify every pixel in an image, creating precise boundaries around objects and regions.

Types of Segmentation

Type	Output	Distinguishes Instances?	Example
Semantic	Class label for every pixel	No — all cars are "car"	Road scene parsing (road, sidewalk, building, sky)
Instance	Class + unique ID per object	Yes — car-1, car-2, car-3	Counting objects, tracking individuals
Panoptic	Combines semantic + instance	Yes for "things," no for "stuff"	Full scene understanding (sky=stuff, car=thing)

U-Net Architecture

U-Net was originally designed for medical image segmentation and has become one of the most influential segmentation architectures. Its key innovation is the encoder-decoder structure with skip connections:

Encoder (contracting path): Captures context through downsampling convolutions and pooling
Decoder (expanding path): Recovers spatial information through upsampling and transposed convolutions
Skip connections: Connect corresponding encoder and decoder layers, preserving fine-grained spatial details

Python - Simple U-Net (PyTorch)

import torch
import torch.nn as nn
import segmentation_models_pytorch as smp

# Using segmentation_models_pytorch for easy U-Net
model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=3,
    classes=5  # Number of segmentation classes
)

# Forward pass
x = torch.randn(1, 3, 256, 256)  # Batch of 1 image
output = model(x)
print(output.shape)  # torch.Size([1, 5, 256, 256])
# Each pixel has 5 class probabilities

Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding a segmentation mask branch alongside the existing bounding box and classification branches. For each detected object, it predicts both a bounding box and a pixel-level mask.

Python - Mask R-CNN with Detectron2

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
import cv2

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

predictor = DefaultPredictor(cfg)
img = cv2.imread("photo.jpg")
outputs = predictor(img)

# outputs["instances"] contains boxes, classes, scores, and masks
masks = outputs["instances"].pred_masks
print(f"Found {len(masks)} instances")

DeepLab

DeepLab uses atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing resolution. DeepLabV3+ adds a decoder module for sharper boundaries.

SAM (Segment Anything Model)

SAM by Meta AI is a foundation model for segmentation. Trained on 11 million images with 1 billion masks, it can segment any object in any image with various types of prompts:

Point prompts: Click on an object to segment it
Box prompts: Draw a bounding box around the object
Text prompts: Describe what to segment (SAM 2 and extensions)
Automatic mode: Segment everything in the image

Python - SAM Usage

from segment_anything import SamPredictor, sam_model_registry

# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
image = cv2.imread("photo.jpg")
predictor.set_image(image)

# Segment with point prompt
input_point = np.array([[500, 375]])  # Click coordinates
input_label = np.array([1])           # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

print(f"Generated {len(masks)} masks")
print(f"Best mask score: {scores.max():.3f}")

Medical Image Segmentation

Medical imaging is one of the most impactful applications of segmentation:

Tumor detection: Segmenting tumors in MRI and CT scans
Organ segmentation: Delineating organs for surgical planning
Cell segmentation: Identifying and counting cells in microscopy images
Retinal analysis: Segmenting blood vessels and lesions in fundus images

💡

Note: Medical image segmentation requires extra care with data privacy, annotation quality (typically done by medical professionals), and thorough validation before clinical deployment.

✅

Key takeaway: Segmentation provides pixel-level understanding of images. U-Net remains a strong baseline, Mask R-CNN handles instance segmentation, and SAM is a game-changer as a foundation model that can segment anything. Choose the right type (semantic, instance, panoptic) based on your application needs.

← Previous Image Classification Next → Advanced Topics