Intermediate

Segmentation Questions

These 10 questions cover image segmentation concepts critical for CV roles in medical imaging, autonomous driving, robotics, and augmented reality. Segmentation requires pixel-level understanding, making it one of the most technically demanding CV tasks.

Q1: What is the difference between semantic, instance, and panoptic segmentation?

💡
Model Answer:
TypeWhat It DoesExample OutputKey Models
Semantic SegmentationAssigns a class label to every pixel. Does NOT distinguish between instances of the same classAll "person" pixels get the same label, all "car" pixels get the same labelFCN, DeepLab, SegFormer
Instance SegmentationDetects each object instance and provides a pixel mask for each. Only for "thing" classes (countable objects)Person 1 gets mask A, Person 2 gets mask B. Background is unlabeledMask R-CNN, YOLACT, SOLOv2
Panoptic SegmentationUnifies both: every pixel gets a class label AND instance ID for "thing" classes. "Stuff" classes (sky, road) get only class labelsPerson 1 (ID=1), Person 2 (ID=2), sky (no instance), road (no instance)Panoptic FPN, MaskFormer, Mask2Former

"Things" vs "stuff": Things are countable objects (person, car, dog). Stuff is amorphous regions (sky, grass, road). Panoptic segmentation distinguishes instances for things but not stuff.

Modern unification: Mask2Former (2022) uses a single architecture for all three tasks. It treats every segment (thing or stuff) as a masked attention query, achieving state-of-the-art on semantic, instance, and panoptic benchmarks with the same model.

Q2: Explain the U-Net architecture and why it works well for medical image segmentation.

💡
Model Answer:

U-Net is an encoder-decoder architecture with skip connections that form a U-shape:

  • Encoder (contracting path): Series of 3x3 conv + ReLU + 2x2 max pool blocks. Doubles channels and halves spatial resolution at each level. Captures context and "what" is in the image.
  • Decoder (expanding path): 2x2 transposed convolutions (or bilinear upsample + 1x1 conv) to increase spatial resolution. Followed by 3x3 conv blocks.
  • Skip connections: Concatenate encoder features with decoder features at matching resolutions. This is the critical innovation — it provides fine-grained spatial detail from early layers to the decoder.

Why it excels for medical imaging:

  • Works with limited data: Skip connections enable learning from very small datasets (hundreds of images) because they preserve spatial information that would otherwise be lost through pooling.
  • Precise boundaries: The concatenation of encoder features provides pixel-level localization, critical for medical applications where tumor boundaries must be exact.
  • Heavy augmentation compatibility: U-Net was designed with elastic deformations for data augmentation, which is particularly effective for medical tissues.
import torch
import torch.nn as nn

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2):
        super().__init__()
        # Encoder
        self.enc1 = self._block(in_channels, 64)
        self.enc2 = self._block(64, 128)
        self.enc3 = self._block(128, 256)
        self.enc4 = self._block(256, 512)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = self._block(512, 1024)

        # Decoder
        self.up4 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
        self.dec4 = self._block(1024, 512)  # 512 + 512 from skip
        self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
        self.dec3 = self._block(512, 256)
        self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
        self.dec2 = self._block(256, 128)
        self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
        self.dec1 = self._block(128, 64)

        self.out = nn.Conv2d(64, num_classes, kernel_size=1)

    def _block(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
        )

    def forward(self, x):
        # Encoder
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        e4 = self.enc4(self.pool(e3))

        # Bottleneck
        b = self.bottleneck(self.pool(e4))

        # Decoder with skip connections
        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))

        return self.out(d1)  # (B, num_classes, H, W)

Q3: How does Mask R-CNN extend Faster R-CNN for instance segmentation?

💡
Model Answer:

Mask R-CNN adds a parallel mask prediction branch to Faster R-CNN's existing classification and box regression branches:

  1. Backbone + FPN: Same as Faster R-CNN — ResNet + FPN produces multi-scale features
  2. RPN: Same — generates region proposals
  3. RoI Align: Critical upgrade from RoI Pooling. Uses bilinear interpolation instead of quantization, eliminating misalignment between the RoI and extracted features. This is essential for pixel-accurate masks.
  4. Mask branch: A small FCN (4 conv layers + 1 deconv layer) that outputs a 28x28 binary mask per class for each RoI. The mask branch predicts K masks (one per class), and the classification branch selects which mask to use.

Key design decisions:

  • Decoupled mask and classification: The mask branch predicts masks independently for each class (no inter-class competition). This avoids the problem of masks bleeding between classes.
  • Loss: Multi-task loss = L_cls + L_box + L_mask. Mask loss is binary cross-entropy applied only to the mask for the predicted class (not all K masks).
  • RoI Align matters: Replacing RoI Pooling with RoI Align improved mask AP by ~3 points, showing that pixel-level alignment is crucial for segmentation.

Variants: PointRend adds a refinement module that treats mask edge pixels as a point cloud for higher-resolution boundaries. Cascade Mask R-CNN uses multiple refinement stages for better box and mask quality.

Q4: What loss functions are used for segmentation? Compare Dice loss vs cross-entropy.

💡
Model Answer:
LossFormulaStrengthsWeaknesses
Cross-Entropy-sum(y * log(p)) per pixelStandard, well-understood, smooth gradientsDominated by majority class in imbalanced data (e.g., 95% background)
Dice Loss1 - 2*|P intersection G| / (|P| + |G|)Directly optimizes overlap. Handles class imbalance naturally since it measures relative overlap, not absolute pixel countsNoisy gradients for very small regions. Can be unstable early in training
Focal Loss-alpha * (1-p)^gamma * log(p)Down-weights easy pixels. Useful for small, hard-to-segment objectsExtra hyperparameters (alpha, gamma) to tune
Tversky LossGeneralization of Dice with separate FP/FN weightsControl precision-recall trade-off by penalizing FP and FN differentlyMore hyperparameters. Needs domain knowledge to set FP/FN weights

Best practice: Combine cross-entropy and Dice loss: L = CE + Dice. Cross-entropy provides stable per-pixel gradients while Dice ensures the model optimizes for overlap. This combination works well across most segmentation tasks.

import torch
import torch.nn.functional as F

def dice_loss(pred, target, smooth=1.0):
    """Dice loss for binary segmentation.
    Args:
        pred: (B, 1, H, W) sigmoid output
        target: (B, 1, H, W) binary mask
    """
    pred = pred.flatten(1)   # (B, H*W)
    target = target.flatten(1)

    intersection = (pred * target).sum(dim=1)
    union = pred.sum(dim=1) + target.sum(dim=1)

    dice = (2.0 * intersection + smooth) / (union + smooth)
    return 1.0 - dice.mean()

def combined_loss(pred_logits, target):
    """Combined CE + Dice loss."""
    pred_prob = torch.sigmoid(pred_logits)
    ce = F.binary_cross_entropy_with_logits(pred_logits, target)
    dice = dice_loss(pred_prob, target)
    return ce + dice

Q5: What is DeepLab and how does atrous (dilated) convolution help segmentation?

💡
Model Answer:

DeepLab is a family of semantic segmentation models that use atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing spatial resolution.

Atrous convolution: Inserts gaps (zeros) between kernel elements, expanding the receptive field without increasing parameters or reducing resolution. A 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5 but only 9 parameters.

ASPP (Atrous Spatial Pyramid Pooling): Applies parallel atrous convolutions at multiple dilation rates (e.g., 6, 12, 18) plus a global average pooling branch. Concatenates results to capture features at multiple scales simultaneously.

Evolution:

  • DeepLabV1: Atrous convolutions + CRF post-processing for boundary refinement
  • DeepLabV2: Added ASPP for multi-scale context
  • DeepLabV3: Improved ASPP with batch norm and image-level features. Removed CRF
  • DeepLabV3+: Added an encoder-decoder structure with a simple decoder that uses low-level features for sharper boundaries. This is the most widely used version

Why atrous convolutions beat pooling for segmentation: Pooling reduces resolution, which is destructive for dense prediction. Atrous convolutions increase receptive field while maintaining resolution, preserving fine spatial detail needed for accurate pixel labels.

Q6: What metrics are used to evaluate segmentation models?

💡
Model Answer:
MetricFormulaUse Case
Pixel AccuracyCorrect pixels / Total pixelsSimplest metric. Misleading when classes are imbalanced (90% accuracy by predicting all background)
mIoU (mean IoU)Average IoU across all classes: IoU per class = TP / (TP + FP + FN)The standard metric for semantic segmentation. Used on ADE20K, Cityscapes, PASCAL VOC
Dice Score / F12*TP / (2*TP + FP + FN)Medical imaging standard. Equivalent to F1 score. More forgiving than IoU for small objects
AP (mask)Average Precision using mask IoUInstance segmentation (COCO). Same as detection AP but uses mask overlap instead of box overlap
PQ (Panoptic Quality)SQ * RQ where SQ=segmentation quality, RQ=recognition qualityPanoptic segmentation. Decomposes into detection quality (did you find it?) and segmentation quality (how well did you segment it?)
Boundary F1F1 score computed on boundary pixels onlyEvaluates edge quality specifically, important for applications requiring precise contours

Relationship between Dice and IoU: Dice = 2*IoU / (1 + IoU). Dice is always higher than IoU for the same prediction. A Dice of 0.90 corresponds to an IoU of 0.818. When comparing models, make sure you are using the same metric.

Q7: How does the Segment Anything Model (SAM) work and why is it significant?

💡
Model Answer:

SAM (Meta, 2023) is a foundation model for segmentation that can segment any object in any image given a prompt (point, box, text, or mask).

Architecture:

  • Image Encoder: ViT-H (huge) pretrained with MAE. Runs once per image to produce image embeddings. This is the expensive part (~0.15s per image on GPU).
  • Prompt Encoder: Encodes user prompts (points, boxes, masks) into embedding space. Lightweight.
  • Mask Decoder: Transformer decoder that takes image embeddings + prompt embeddings and outputs segmentation masks. Very fast (~50ms). Outputs 3 masks at different granularities (whole object, part, subpart) with confidence scores.

Training: Trained on SA-1B dataset (11M images, 1.1B masks) using a data engine that iteratively collected masks using the model itself (model-in-the-loop annotation).

Significance:

  • Zero-shot transfer: Works on any image domain (medical, satellite, microscopy) without fine-tuning
  • Interactive segmentation: Users provide clicks/boxes, SAM segments instantly. Replaces manual annotation tools
  • Foundation model paradigm: Like GPT for language, SAM demonstrates that pretraining on massive data creates generalizable vision capabilities

SAM 2 (2024): Extends to video segmentation with a memory mechanism that tracks objects across frames. Uses streaming architecture for real-time performance.

Q8: What is the difference between transposed convolution and bilinear upsampling?

💡
Model Answer:
AspectTransposed ConvolutionBilinear Upsampling
How it worksLearnable upsampling: inserts zeros between input pixels, then applies convolutionFixed interpolation: computes output pixels as weighted average of 4 nearest input pixels
ParametersLearnable kernel weights (same count as corresponding conv layer)Zero parameters (purely geometric)
ArtifactsCan produce checkerboard artifacts due to uneven overlap of the transposed kernelSmooth output, no artifacts
Best practiceAvoid alone. If used, follow with a regular conv to fix artifactsUse bilinear upsample followed by a 3x3 conv for learnable refinement

Recommendation: Use nn.Upsample(scale_factor=2, mode='bilinear') + nn.Conv2d(in_ch, out_ch, 3, padding=1) instead of nn.ConvTranspose2d. This avoids checkerboard artifacts and is standard in modern architectures (DeepLabV3+, SegFormer).

PixelShuffle (sub-pixel convolution): A third option that rearranges channels into spatial dimensions. Used in super-resolution (ESPCN). Efficient and artifact-free: nn.Conv2d(in_ch, out_ch * r^2, 3) + nn.PixelShuffle(r).

Q9: How do you handle class imbalance in segmentation (e.g., small tumors in large images)?

💡
Model Answer:

Class imbalance in segmentation is extreme: in medical imaging, a tumor might occupy 1% of pixels. In autonomous driving, a pedestrian might be 0.1% of the image. Standard cross-entropy loss will ignore these small regions.

Strategies (combine multiple):

  1. Dice loss or Tversky loss: Region-based losses that measure overlap proportion, inherently handling imbalance. For very small objects, use Tversky loss with higher FN penalty (beta=0.7).
  2. Weighted cross-entropy: Inverse frequency weighting: weight_c = 1 / freq_c. Or median frequency balancing: weight_c = median_freq / freq_c.
  3. Focal loss for segmentation: -(1-p)^gamma * log(p) per pixel. Down-weights easy background pixels, focuses on hard boundary pixels.
  4. Patch-based training: Extract patches centered on foreground regions during training. Ensures each batch contains meaningful positive samples. Used in nnU-Net for medical segmentation.
  5. Deep supervision: Add auxiliary loss at intermediate decoder levels. Helps small object features survive through the network.
  6. Online hard example mining (OHEM): Only backpropagate through the k% hardest pixels per image. Focuses learning on misclassified boundary regions.

nnU-Net approach: The winning framework for medical segmentation automatically selects loss (Dice + CE), patch size based on object size, and architecture based on dataset properties. It demonstrates that good engineering practices often matter more than novel architectures.

Q10: Explain SegFormer and how transformers are applied to segmentation.

💡
Model Answer:

SegFormer (2021, Xie et al.) is a transformer-based segmentation architecture that achieves state-of-the-art results with a simple and efficient design.

Architecture:

  • Hierarchical Transformer Encoder (Mix Transformer - MiT): Produces multi-scale features at 1/4, 1/8, 1/16, 1/32 resolution (like a CNN backbone). Uses overlapping patch embeddings (unlike ViT's non-overlapping patches) and efficient self-attention with spatial reduction (reduces K and V spatial dimensions).
  • Lightweight All-MLP Decoder: Takes multi-scale features from the encoder, projects each to a common channel dimension with a linear layer, upsamples all to 1/4 resolution, concatenates, and applies a final linear classifier. No convolutions, no complex decoder.

Key innovations:

  • Efficient self-attention: Reduces the key/value spatial dimensions by a factor R (e.g., R=64 for early layers). Reduces complexity from O(N^2) to O(N^2/R).
  • No positional encoding: Uses 3x3 depthwise convolutions in the FFN instead of positional encodings, which provides positional information while being flexible to different input sizes.
  • Simple decoder: Proves that a powerful encoder makes an expensive decoder unnecessary. The all-MLP decoder has minimal parameters.

Results: SegFormer-B5 achieves 82.4 mIoU on ADE20K. SegFormer-B0 achieves 76.2 mIoU with only 3.8M parameters, making it suitable for edge deployment.