Intermediate

Segmentation Questions

These 10 questions cover image segmentation concepts critical for CV roles in medical imaging, autonomous driving, robotics, and augmented reality. Segmentation requires pixel-level understanding, making it one of the most technically demanding CV tasks.

Q1: What is the difference between semantic, instance, and panoptic segmentation?

💡

Model Answer:

Type	What It Does	Example Output	Key Models
Semantic Segmentation	Assigns a class label to every pixel. Does NOT distinguish between instances of the same class	All "person" pixels get the same label, all "car" pixels get the same label	FCN, DeepLab, SegFormer
Instance Segmentation	Detects each object instance and provides a pixel mask for each. Only for "thing" classes (countable objects)	Person 1 gets mask A, Person 2 gets mask B. Background is unlabeled	Mask R-CNN, YOLACT, SOLOv2
Panoptic Segmentation	Unifies both: every pixel gets a class label AND instance ID for "thing" classes. "Stuff" classes (sky, road) get only class labels	Person 1 (ID=1), Person 2 (ID=2), sky (no instance), road (no instance)	Panoptic FPN, MaskFormer, Mask2Former

"Things" vs "stuff": Things are countable objects (person, car, dog). Stuff is amorphous regions (sky, grass, road). Panoptic segmentation distinguishes instances for things but not stuff.

Modern unification: Mask2Former (2022) uses a single architecture for all three tasks. It treats every segment (thing or stuff) as a masked attention query, achieving state-of-the-art on semantic, instance, and panoptic benchmarks with the same model.

Q2: Explain the U-Net architecture and why it works well for medical image segmentation.

💡

Model Answer:

U-Net is an encoder-decoder architecture with skip connections that form a U-shape:

Encoder (contracting path): Series of 3x3 conv + ReLU + 2x2 max pool blocks. Doubles channels and halves spatial resolution at each level. Captures context and "what" is in the image.
Decoder (expanding path): 2x2 transposed convolutions (or bilinear upsample + 1x1 conv) to increase spatial resolution. Followed by 3x3 conv blocks.
Skip connections: Concatenate encoder features with decoder features at matching resolutions. This is the critical innovation — it provides fine-grained spatial detail from early layers to the decoder.

Why it excels for medical imaging:

Works with limited data: Skip connections enable learning from very small datasets (hundreds of images) because they preserve spatial information that would otherwise be lost through pooling.
Precise boundaries: The concatenation of encoder features provides pixel-level localization, critical for medical applications where tumor boundaries must be exact.
Heavy augmentation compatibility: U-Net was designed with elastic deformations for data augmentation, which is particularly effective for medical tissues.

import torch
import torch.nn as nn

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2):
        super().__init__()
        # Encoder
        self.enc1 = self._block(in_channels, 64)
        self.enc2 = self._block(64, 128)
        self.enc3 = self._block(128, 256)
        self.enc4 = self._block(256, 512)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = self._block(512, 1024)

        # Decoder
        self.up4 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
        self.dec4 = self._block(1024, 512)  # 512 + 512 from skip
        self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
        self.dec3 = self._block(512, 256)
        self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
        self.dec2 = self._block(256, 128)
        self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
        self.dec1 = self._block(128, 64)

        self.out = nn.Conv2d(64, num_classes, kernel_size=1)

    def _block(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
        )

    def forward(self, x):
        # Encoder
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        e4 = self.enc4(self.pool(e3))

        # Bottleneck
        b = self.bottleneck(self.pool(e4))

        # Decoder with skip connections
        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))

        return self.out(d1)  # (B, num_classes, H, W)

Q3: How does Mask R-CNN extend Faster R-CNN for instance segmentation?

💡

Model Answer:

Mask R-CNN adds a parallel mask prediction branch to Faster R-CNN's existing classification and box regression branches:

Backbone + FPN: Same as Faster R-CNN — ResNet + FPN produces multi-scale features
RPN: Same — generates region proposals
RoI Align: Critical upgrade from RoI Pooling. Uses bilinear interpolation instead of quantization, eliminating misalignment between the RoI and extracted features. This is essential for pixel-accurate masks.
Mask branch: A small FCN (4 conv layers + 1 deconv layer) that outputs a 28x28 binary mask per class for each RoI. The mask branch predicts K masks (one per class), and the classification branch selects which mask to use.

Key design decisions:

Decoupled mask and classification: The mask branch predicts masks independently for each class (no inter-class competition). This avoids the problem of masks bleeding between classes.
Loss: Multi-task loss = L_cls + L_box + L_mask. Mask loss is binary cross-entropy applied only to the mask for the predicted class (not all K masks).
RoI Align matters: Replacing RoI Pooling with RoI Align improved mask AP by ~3 points, showing that pixel-level alignment is crucial for segmentation.

Variants: PointRend adds a refinement module that treats mask edge pixels as a point cloud for higher-resolution boundaries. Cascade Mask R-CNN uses multiple refinement stages for better box and mask quality.

Q4: What loss functions are used for segmentation? Compare Dice loss vs cross-entropy.

💡

Model Answer:

Loss	Formula	Strengths	Weaknesses
Cross-Entropy	`-sum(y * log(p))` per pixel	Standard, well-understood, smooth gradients	Dominated by majority class in imbalanced data (e.g., 95% background)
Dice Loss	`1 - 2*\|P intersection G\| / (\|P\| + \|G\|)`	Directly optimizes overlap. Handles class imbalance naturally since it measures relative overlap, not absolute pixel counts	Noisy gradients for very small regions. Can be unstable early in training
Focal Loss	`-alpha * (1-p)^gamma * log(p)`	Down-weights easy pixels. Useful for small, hard-to-segment objects	Extra hyperparameters (alpha, gamma) to tune
Tversky Loss	Generalization of Dice with separate FP/FN weights	Control precision-recall trade-off by penalizing FP and FN differently	More hyperparameters. Needs domain knowledge to set FP/FN weights

Best practice: Combine cross-entropy and Dice loss: L = CE + Dice. Cross-entropy provides stable per-pixel gradients while Dice ensures the model optimizes for overlap. This combination works well across most segmentation tasks.

import torch
import torch.nn.functional as F

def dice_loss(pred, target, smooth=1.0):
    """Dice loss for binary segmentation.
    Args:
        pred: (B, 1, H, W) sigmoid output
        target: (B, 1, H, W) binary mask
    """
    pred = pred.flatten(1)   # (B, H*W)
    target = target.flatten(1)

    intersection = (pred * target).sum(dim=1)
    union = pred.sum(dim=1) + target.sum(dim=1)

    dice = (2.0 * intersection + smooth) / (union + smooth)
    return 1.0 - dice.mean()

def combined_loss(pred_logits, target):
    """Combined CE + Dice loss."""
    pred_prob = torch.sigmoid(pred_logits)
    ce = F.binary_cross_entropy_with_logits(pred_logits, target)
    dice = dice_loss(pred_prob, target)
    return ce + dice

Q5: What is DeepLab and how does atrous (dilated) convolution help segmentation?

💡

Model Answer:

DeepLab is a family of semantic segmentation models that use atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing spatial resolution.

Atrous convolution: Inserts gaps (zeros) between kernel elements, expanding the receptive field without increasing parameters or reducing resolution. A 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5 but only 9 parameters.

ASPP (Atrous Spatial Pyramid Pooling): Applies parallel atrous convolutions at multiple dilation rates (e.g., 6, 12, 18) plus a global average pooling branch. Concatenates results to capture features at multiple scales simultaneously.

Evolution:

DeepLabV1: Atrous convolutions + CRF post-processing for boundary refinement
DeepLabV2: Added ASPP for multi-scale context
DeepLabV3: Improved ASPP with batch norm and image-level features. Removed CRF
DeepLabV3+: Added an encoder-decoder structure with a simple decoder that uses low-level features for sharper boundaries. This is the most widely used version

Why atrous convolutions beat pooling for segmentation: Pooling reduces resolution, which is destructive for dense prediction. Atrous convolutions increase receptive field while maintaining resolution, preserving fine spatial detail needed for accurate pixel labels.

Q6: What metrics are used to evaluate segmentation models?

💡

Model Answer:

Metric	Formula	Use Case
Pixel Accuracy	Correct pixels / Total pixels	Simplest metric. Misleading when classes are imbalanced (90% accuracy by predicting all background)
mIoU (mean IoU)	Average IoU across all classes: IoU per class = TP / (TP + FP + FN)	The standard metric for semantic segmentation. Used on ADE20K, Cityscapes, PASCAL VOC
Dice Score / F1	2TP / (2TP + FP + FN)	Medical imaging standard. Equivalent to F1 score. More forgiving than IoU for small objects
AP (mask)	Average Precision using mask IoU	Instance segmentation (COCO). Same as detection AP but uses mask overlap instead of box overlap
PQ (Panoptic Quality)	SQ * RQ where SQ=segmentation quality, RQ=recognition quality	Panoptic segmentation. Decomposes into detection quality (did you find it?) and segmentation quality (how well did you segment it?)
Boundary F1	F1 score computed on boundary pixels only	Evaluates edge quality specifically, important for applications requiring precise contours

Relationship between Dice and IoU: Dice = 2*IoU / (1 + IoU). Dice is always higher than IoU for the same prediction. A Dice of 0.90 corresponds to an IoU of 0.818. When comparing models, make sure you are using the same metric.

Q7: How does the Segment Anything Model (SAM) work and why is it significant?

💡

Model Answer:

SAM (Meta, 2023) is a foundation model for segmentation that can segment any object in any image given a prompt (point, box, text, or mask).

Architecture:

Image Encoder: ViT-H (huge) pretrained with MAE. Runs once per image to produce image embeddings. This is the expensive part (~0.15s per image on GPU).
Prompt Encoder: Encodes user prompts (points, boxes, masks) into embedding space. Lightweight.
Mask Decoder: Transformer decoder that takes image embeddings + prompt embeddings and outputs segmentation masks. Very fast (~50ms). Outputs 3 masks at different granularities (whole object, part, subpart) with confidence scores.

Training: Trained on SA-1B dataset (11M images, 1.1B masks) using a data engine that iteratively collected masks using the model itself (model-in-the-loop annotation).

Significance:

Zero-shot transfer: Works on any image domain (medical, satellite, microscopy) without fine-tuning
Interactive segmentation: Users provide clicks/boxes, SAM segments instantly. Replaces manual annotation tools
Foundation model paradigm: Like GPT for language, SAM demonstrates that pretraining on massive data creates generalizable vision capabilities

SAM 2 (2024): Extends to video segmentation with a memory mechanism that tracks objects across frames. Uses streaming architecture for real-time performance.

Q8: What is the difference between transposed convolution and bilinear upsampling?

💡

Model Answer:

Aspect	Transposed Convolution	Bilinear Upsampling
How it works	Learnable upsampling: inserts zeros between input pixels, then applies convolution	Fixed interpolation: computes output pixels as weighted average of 4 nearest input pixels
Parameters	Learnable kernel weights (same count as corresponding conv layer)	Zero parameters (purely geometric)
Artifacts	Can produce checkerboard artifacts due to uneven overlap of the transposed kernel	Smooth output, no artifacts
Best practice	Avoid alone. If used, follow with a regular conv to fix artifacts	Use bilinear upsample followed by a 3x3 conv for learnable refinement

Recommendation: Use nn.Upsample(scale_factor=2, mode='bilinear') + nn.Conv2d(in_ch, out_ch, 3, padding=1) instead of nn.ConvTranspose2d. This avoids checkerboard artifacts and is standard in modern architectures (DeepLabV3+, SegFormer).

PixelShuffle (sub-pixel convolution): A third option that rearranges channels into spatial dimensions. Used in super-resolution (ESPCN). Efficient and artifact-free: nn.Conv2d(in_ch, out_ch * r^2, 3) + nn.PixelShuffle(r).

Q9: How do you handle class imbalance in segmentation (e.g., small tumors in large images)?

💡

Model Answer:

Class imbalance in segmentation is extreme: in medical imaging, a tumor might occupy 1% of pixels. In autonomous driving, a pedestrian might be 0.1% of the image. Standard cross-entropy loss will ignore these small regions.

Strategies (combine multiple):

Dice loss or Tversky loss: Region-based losses that measure overlap proportion, inherently handling imbalance. For very small objects, use Tversky loss with higher FN penalty (beta=0.7).
Weighted cross-entropy: Inverse frequency weighting: weight_c = 1 / freq_c. Or median frequency balancing: weight_c = median_freq / freq_c.
Focal loss for segmentation: -(1-p)^gamma * log(p) per pixel. Down-weights easy background pixels, focuses on hard boundary pixels.
Patch-based training: Extract patches centered on foreground regions during training. Ensures each batch contains meaningful positive samples. Used in nnU-Net for medical segmentation.
Deep supervision: Add auxiliary loss at intermediate decoder levels. Helps small object features survive through the network.
Online hard example mining (OHEM): Only backpropagate through the k% hardest pixels per image. Focuses learning on misclassified boundary regions.

nnU-Net approach: The winning framework for medical segmentation automatically selects loss (Dice + CE), patch size based on object size, and architecture based on dataset properties. It demonstrates that good engineering practices often matter more than novel architectures.

Q10: Explain SegFormer and how transformers are applied to segmentation.

💡

Model Answer:

SegFormer (2021, Xie et al.) is a transformer-based segmentation architecture that achieves state-of-the-art results with a simple and efficient design.

Architecture:

Hierarchical Transformer Encoder (Mix Transformer - MiT): Produces multi-scale features at 1/4, 1/8, 1/16, 1/32 resolution (like a CNN backbone). Uses overlapping patch embeddings (unlike ViT's non-overlapping patches) and efficient self-attention with spatial reduction (reduces K and V spatial dimensions).
Lightweight All-MLP Decoder: Takes multi-scale features from the encoder, projects each to a common channel dimension with a linear layer, upsamples all to 1/4 resolution, concatenates, and applies a final linear classifier. No convolutions, no complex decoder.

Key innovations:

Efficient self-attention: Reduces the key/value spatial dimensions by a factor R (e.g., R=64 for early layers). Reduces complexity from O(N^2) to O(N^2/R).
No positional encoding: Uses 3x3 depthwise convolutions in the FFN instead of positional encodings, which provides positional information while being flexible to different input sizes.
Simple decoder: Proves that a powerful encoder makes an expensive decoder unnecessary. The all-MLP decoder has minimal parameters.

Results: SegFormer-B5 achieves 82.4 mIoU on ADE20K. SegFormer-B0 achieves 76.2 mIoU with only 3.8M parameters, making it suitable for edge deployment.

← Previous Object Detection Questions Next → Advanced CV Topics