Segmentation Questions
These 10 questions cover image segmentation concepts critical for CV roles in medical imaging, autonomous driving, robotics, and augmented reality. Segmentation requires pixel-level understanding, making it one of the most technically demanding CV tasks.
Q1: What is the difference between semantic, instance, and panoptic segmentation?
| Type | What It Does | Example Output | Key Models |
|---|---|---|---|
| Semantic Segmentation | Assigns a class label to every pixel. Does NOT distinguish between instances of the same class | All "person" pixels get the same label, all "car" pixels get the same label | FCN, DeepLab, SegFormer |
| Instance Segmentation | Detects each object instance and provides a pixel mask for each. Only for "thing" classes (countable objects) | Person 1 gets mask A, Person 2 gets mask B. Background is unlabeled | Mask R-CNN, YOLACT, SOLOv2 |
| Panoptic Segmentation | Unifies both: every pixel gets a class label AND instance ID for "thing" classes. "Stuff" classes (sky, road) get only class labels | Person 1 (ID=1), Person 2 (ID=2), sky (no instance), road (no instance) | Panoptic FPN, MaskFormer, Mask2Former |
"Things" vs "stuff": Things are countable objects (person, car, dog). Stuff is amorphous regions (sky, grass, road). Panoptic segmentation distinguishes instances for things but not stuff.
Modern unification: Mask2Former (2022) uses a single architecture for all three tasks. It treats every segment (thing or stuff) as a masked attention query, achieving state-of-the-art on semantic, instance, and panoptic benchmarks with the same model.
Q2: Explain the U-Net architecture and why it works well for medical image segmentation.
U-Net is an encoder-decoder architecture with skip connections that form a U-shape:
- Encoder (contracting path): Series of 3x3 conv + ReLU + 2x2 max pool blocks. Doubles channels and halves spatial resolution at each level. Captures context and "what" is in the image.
- Decoder (expanding path): 2x2 transposed convolutions (or bilinear upsample + 1x1 conv) to increase spatial resolution. Followed by 3x3 conv blocks.
- Skip connections: Concatenate encoder features with decoder features at matching resolutions. This is the critical innovation — it provides fine-grained spatial detail from early layers to the decoder.
Why it excels for medical imaging:
- Works with limited data: Skip connections enable learning from very small datasets (hundreds of images) because they preserve spatial information that would otherwise be lost through pooling.
- Precise boundaries: The concatenation of encoder features provides pixel-level localization, critical for medical applications where tumor boundaries must be exact.
- Heavy augmentation compatibility: U-Net was designed with elastic deformations for data augmentation, which is particularly effective for medical tissues.
import torch
import torch.nn as nn
class UNet(nn.Module):
def __init__(self, in_channels=1, num_classes=2):
super().__init__()
# Encoder
self.enc1 = self._block(in_channels, 64)
self.enc2 = self._block(64, 128)
self.enc3 = self._block(128, 256)
self.enc4 = self._block(256, 512)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bottleneck = self._block(512, 1024)
# Decoder
self.up4 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
self.dec4 = self._block(1024, 512) # 512 + 512 from skip
self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
self.dec3 = self._block(512, 256)
self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
self.dec2 = self._block(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
self.dec1 = self._block(128, 64)
self.out = nn.Conv2d(64, num_classes, kernel_size=1)
def _block(self, in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
)
def forward(self, x):
# Encoder
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
e3 = self.enc3(self.pool(e2))
e4 = self.enc4(self.pool(e3))
# Bottleneck
b = self.bottleneck(self.pool(e4))
# Decoder with skip connections
d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
return self.out(d1) # (B, num_classes, H, W)
Q3: How does Mask R-CNN extend Faster R-CNN for instance segmentation?
Mask R-CNN adds a parallel mask prediction branch to Faster R-CNN's existing classification and box regression branches:
- Backbone + FPN: Same as Faster R-CNN — ResNet + FPN produces multi-scale features
- RPN: Same — generates region proposals
- RoI Align: Critical upgrade from RoI Pooling. Uses bilinear interpolation instead of quantization, eliminating misalignment between the RoI and extracted features. This is essential for pixel-accurate masks.
- Mask branch: A small FCN (4 conv layers + 1 deconv layer) that outputs a 28x28 binary mask per class for each RoI. The mask branch predicts K masks (one per class), and the classification branch selects which mask to use.
Key design decisions:
- Decoupled mask and classification: The mask branch predicts masks independently for each class (no inter-class competition). This avoids the problem of masks bleeding between classes.
- Loss: Multi-task loss = L_cls + L_box + L_mask. Mask loss is binary cross-entropy applied only to the mask for the predicted class (not all K masks).
- RoI Align matters: Replacing RoI Pooling with RoI Align improved mask AP by ~3 points, showing that pixel-level alignment is crucial for segmentation.
Variants: PointRend adds a refinement module that treats mask edge pixels as a point cloud for higher-resolution boundaries. Cascade Mask R-CNN uses multiple refinement stages for better box and mask quality.
Q4: What loss functions are used for segmentation? Compare Dice loss vs cross-entropy.
| Loss | Formula | Strengths | Weaknesses |
|---|---|---|---|
| Cross-Entropy | -sum(y * log(p)) per pixel | Standard, well-understood, smooth gradients | Dominated by majority class in imbalanced data (e.g., 95% background) |
| Dice Loss | 1 - 2*|P intersection G| / (|P| + |G|) | Directly optimizes overlap. Handles class imbalance naturally since it measures relative overlap, not absolute pixel counts | Noisy gradients for very small regions. Can be unstable early in training |
| Focal Loss | -alpha * (1-p)^gamma * log(p) | Down-weights easy pixels. Useful for small, hard-to-segment objects | Extra hyperparameters (alpha, gamma) to tune |
| Tversky Loss | Generalization of Dice with separate FP/FN weights | Control precision-recall trade-off by penalizing FP and FN differently | More hyperparameters. Needs domain knowledge to set FP/FN weights |
Best practice: Combine cross-entropy and Dice loss: L = CE + Dice. Cross-entropy provides stable per-pixel gradients while Dice ensures the model optimizes for overlap. This combination works well across most segmentation tasks.
import torch
import torch.nn.functional as F
def dice_loss(pred, target, smooth=1.0):
"""Dice loss for binary segmentation.
Args:
pred: (B, 1, H, W) sigmoid output
target: (B, 1, H, W) binary mask
"""
pred = pred.flatten(1) # (B, H*W)
target = target.flatten(1)
intersection = (pred * target).sum(dim=1)
union = pred.sum(dim=1) + target.sum(dim=1)
dice = (2.0 * intersection + smooth) / (union + smooth)
return 1.0 - dice.mean()
def combined_loss(pred_logits, target):
"""Combined CE + Dice loss."""
pred_prob = torch.sigmoid(pred_logits)
ce = F.binary_cross_entropy_with_logits(pred_logits, target)
dice = dice_loss(pred_prob, target)
return ce + dice
Q5: What is DeepLab and how does atrous (dilated) convolution help segmentation?
DeepLab is a family of semantic segmentation models that use atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context without losing spatial resolution.
Atrous convolution: Inserts gaps (zeros) between kernel elements, expanding the receptive field without increasing parameters or reducing resolution. A 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5 but only 9 parameters.
ASPP (Atrous Spatial Pyramid Pooling): Applies parallel atrous convolutions at multiple dilation rates (e.g., 6, 12, 18) plus a global average pooling branch. Concatenates results to capture features at multiple scales simultaneously.
Evolution:
- DeepLabV1: Atrous convolutions + CRF post-processing for boundary refinement
- DeepLabV2: Added ASPP for multi-scale context
- DeepLabV3: Improved ASPP with batch norm and image-level features. Removed CRF
- DeepLabV3+: Added an encoder-decoder structure with a simple decoder that uses low-level features for sharper boundaries. This is the most widely used version
Why atrous convolutions beat pooling for segmentation: Pooling reduces resolution, which is destructive for dense prediction. Atrous convolutions increase receptive field while maintaining resolution, preserving fine spatial detail needed for accurate pixel labels.
Q6: What metrics are used to evaluate segmentation models?
| Metric | Formula | Use Case |
|---|---|---|
| Pixel Accuracy | Correct pixels / Total pixels | Simplest metric. Misleading when classes are imbalanced (90% accuracy by predicting all background) |
| mIoU (mean IoU) | Average IoU across all classes: IoU per class = TP / (TP + FP + FN) | The standard metric for semantic segmentation. Used on ADE20K, Cityscapes, PASCAL VOC |
| Dice Score / F1 | 2*TP / (2*TP + FP + FN) | Medical imaging standard. Equivalent to F1 score. More forgiving than IoU for small objects |
| AP (mask) | Average Precision using mask IoU | Instance segmentation (COCO). Same as detection AP but uses mask overlap instead of box overlap |
| PQ (Panoptic Quality) | SQ * RQ where SQ=segmentation quality, RQ=recognition quality | Panoptic segmentation. Decomposes into detection quality (did you find it?) and segmentation quality (how well did you segment it?) |
| Boundary F1 | F1 score computed on boundary pixels only | Evaluates edge quality specifically, important for applications requiring precise contours |
Relationship between Dice and IoU: Dice = 2*IoU / (1 + IoU). Dice is always higher than IoU for the same prediction. A Dice of 0.90 corresponds to an IoU of 0.818. When comparing models, make sure you are using the same metric.
Q7: How does the Segment Anything Model (SAM) work and why is it significant?
SAM (Meta, 2023) is a foundation model for segmentation that can segment any object in any image given a prompt (point, box, text, or mask).
Architecture:
- Image Encoder: ViT-H (huge) pretrained with MAE. Runs once per image to produce image embeddings. This is the expensive part (~0.15s per image on GPU).
- Prompt Encoder: Encodes user prompts (points, boxes, masks) into embedding space. Lightweight.
- Mask Decoder: Transformer decoder that takes image embeddings + prompt embeddings and outputs segmentation masks. Very fast (~50ms). Outputs 3 masks at different granularities (whole object, part, subpart) with confidence scores.
Training: Trained on SA-1B dataset (11M images, 1.1B masks) using a data engine that iteratively collected masks using the model itself (model-in-the-loop annotation).
Significance:
- Zero-shot transfer: Works on any image domain (medical, satellite, microscopy) without fine-tuning
- Interactive segmentation: Users provide clicks/boxes, SAM segments instantly. Replaces manual annotation tools
- Foundation model paradigm: Like GPT for language, SAM demonstrates that pretraining on massive data creates generalizable vision capabilities
SAM 2 (2024): Extends to video segmentation with a memory mechanism that tracks objects across frames. Uses streaming architecture for real-time performance.
Q8: What is the difference between transposed convolution and bilinear upsampling?
| Aspect | Transposed Convolution | Bilinear Upsampling |
|---|---|---|
| How it works | Learnable upsampling: inserts zeros between input pixels, then applies convolution | Fixed interpolation: computes output pixels as weighted average of 4 nearest input pixels |
| Parameters | Learnable kernel weights (same count as corresponding conv layer) | Zero parameters (purely geometric) |
| Artifacts | Can produce checkerboard artifacts due to uneven overlap of the transposed kernel | Smooth output, no artifacts |
| Best practice | Avoid alone. If used, follow with a regular conv to fix artifacts | Use bilinear upsample followed by a 3x3 conv for learnable refinement |
Recommendation: Use nn.Upsample(scale_factor=2, mode='bilinear') + nn.Conv2d(in_ch, out_ch, 3, padding=1) instead of nn.ConvTranspose2d. This avoids checkerboard artifacts and is standard in modern architectures (DeepLabV3+, SegFormer).
PixelShuffle (sub-pixel convolution): A third option that rearranges channels into spatial dimensions. Used in super-resolution (ESPCN). Efficient and artifact-free: nn.Conv2d(in_ch, out_ch * r^2, 3) + nn.PixelShuffle(r).
Q9: How do you handle class imbalance in segmentation (e.g., small tumors in large images)?
Class imbalance in segmentation is extreme: in medical imaging, a tumor might occupy 1% of pixels. In autonomous driving, a pedestrian might be 0.1% of the image. Standard cross-entropy loss will ignore these small regions.
Strategies (combine multiple):
- Dice loss or Tversky loss: Region-based losses that measure overlap proportion, inherently handling imbalance. For very small objects, use Tversky loss with higher FN penalty (beta=0.7).
- Weighted cross-entropy: Inverse frequency weighting:
weight_c = 1 / freq_c. Or median frequency balancing:weight_c = median_freq / freq_c. - Focal loss for segmentation:
-(1-p)^gamma * log(p)per pixel. Down-weights easy background pixels, focuses on hard boundary pixels. - Patch-based training: Extract patches centered on foreground regions during training. Ensures each batch contains meaningful positive samples. Used in nnU-Net for medical segmentation.
- Deep supervision: Add auxiliary loss at intermediate decoder levels. Helps small object features survive through the network.
- Online hard example mining (OHEM): Only backpropagate through the k% hardest pixels per image. Focuses learning on misclassified boundary regions.
nnU-Net approach: The winning framework for medical segmentation automatically selects loss (Dice + CE), patch size based on object size, and architecture based on dataset properties. It demonstrates that good engineering practices often matter more than novel architectures.
Q10: Explain SegFormer and how transformers are applied to segmentation.
SegFormer (2021, Xie et al.) is a transformer-based segmentation architecture that achieves state-of-the-art results with a simple and efficient design.
Architecture:
- Hierarchical Transformer Encoder (Mix Transformer - MiT): Produces multi-scale features at 1/4, 1/8, 1/16, 1/32 resolution (like a CNN backbone). Uses overlapping patch embeddings (unlike ViT's non-overlapping patches) and efficient self-attention with spatial reduction (reduces K and V spatial dimensions).
- Lightweight All-MLP Decoder: Takes multi-scale features from the encoder, projects each to a common channel dimension with a linear layer, upsamples all to 1/4 resolution, concatenates, and applies a final linear classifier. No convolutions, no complex decoder.
Key innovations:
- Efficient self-attention: Reduces the key/value spatial dimensions by a factor R (e.g., R=64 for early layers). Reduces complexity from O(N^2) to O(N^2/R).
- No positional encoding: Uses 3x3 depthwise convolutions in the FFN instead of positional encodings, which provides positional information while being flexible to different input sizes.
- Simple decoder: Proves that a powerful encoder makes an expensive decoder unnecessary. The all-MLP decoder has minimal parameters.
Results: SegFormer-B5 achieves 82.4 mIoU on ADE20K. SegFormer-B0 achieves 76.2 mIoU with only 3.8M parameters, making it suitable for edge deployment.
Lilly Tech Systems