Practice Questions & Tips
This final lesson brings everything together with rapid-fire questions to test your knowledge, coding challenges to practice, and strategic tips from successful CV interview candidates.
Rapid-Fire Questions
Time yourself: try to answer each in under 60 seconds. These test breadth of knowledge and quick recall — both critical for phone screens and early interview rounds.
| # | Question | Expected Answer (1–2 sentences) |
|---|---|---|
| 1 | What does a 3x3 convolution with stride 2 do to the spatial dimensions? | It halves the spatial dimensions (with appropriate padding). Output size = floor((input + 2*padding - kernel) / stride + 1). For 224x224 input: output is 112x112. |
| 2 | What is the receptive field of a neuron? | The region in the input image that affects that neuron's activation. It grows with network depth: each 3x3 conv layer adds 2 pixels to the receptive field. A 5-layer network of 3x3 convs has a receptive field of 11x11. |
| 3 | Why does ResNet use a bottleneck (1x1 → 3x3 → 1x1)? | The first 1x1 reduces channels (e.g., 256 to 64), the 3x3 operates on fewer channels (cheap), and the second 1x1 expands back. This gives the same representational power with ~4x fewer FLOPs than using two 3x3 convs on the full channel count. |
| 4 | What is the difference between IoU and Dice score? | IoU = intersection/union. Dice = 2*intersection/(sum of areas). Dice = 2*IoU/(1+IoU). Dice is always higher than IoU for the same prediction. Dice of 0.90 = IoU of 0.818. |
| 5 | What is NMS and why is it needed? | Non-Maximum Suppression removes duplicate detections by keeping the highest-confidence box and suppressing overlapping boxes (IoU > threshold). Needed because detectors produce multiple predictions for the same object from nearby anchor positions. |
| 6 | What is the difference between semantic and instance segmentation? | Semantic: assigns a class label to every pixel but does not distinguish between instances (all people get the same label). Instance: detects each object and provides a separate mask per instance (person 1, person 2). Instance only covers "things," not "stuff." |
| 7 | What makes depthwise separable convolutions efficient? | Factoring a standard conv into depthwise (one filter per channel, spatial) + pointwise (1x1, channel mixing). Reduces FLOPs by ~K^2 (9x for 3x3 kernels) with minimal accuracy loss. |
| 8 | How does FPN help detect small objects? | FPN creates high-resolution feature maps (1/4 or 1/8 scale) enriched with high-level semantic information via top-down connections. Small objects are detected on these fine-grained features that still "know" what they are looking at. |
| 9 | What is anchor-free detection? | Predicting objects without predefined anchor boxes. Instead, directly predict center points (CenterNet), per-pixel distances to box edges (FCOS), or object queries (DETR). Simpler, no anchor hyperparameters to tune. Used in YOLOv8. |
| 10 | Name three ways to speed up inference for a CV model. | 1) Quantization (FP32 → INT8, ~4x speedup). 2) TensorRT/ONNX Runtime graph optimization (operator fusion, kernel tuning, ~2-5x). 3) Reduce input resolution (640 → 320, ~4x speedup). Also: pruning, knowledge distillation, smaller architecture. |
| 11 | What is RoI Align and why is it better than RoI Pooling? | RoI Align uses bilinear interpolation to sample feature values at exact floating-point positions, avoiding the quantization artifacts of RoI Pooling. This gives ~2-3 mAP improvement for instance segmentation where pixel-level alignment matters. |
| 12 | What is CIoU loss? | Complete IoU loss adds center distance penalty and aspect ratio consistency to standard IoU loss: L = 1 - IoU + distance_penalty + aspect_ratio_penalty. Converges faster and produces better-localized boxes than L1 or vanilla IoU loss. |
| 13 | How does Vision Transformer (ViT) process images? | Splits image into fixed-size patches (16x16), flattens and linearly projects each patch into a token, adds positional embeddings, prepends a [CLS] token, and processes through standard transformer encoder layers. Classification uses the [CLS] output. |
| 14 | What is Focal Loss? | FL = -(1-p_t)^gamma * log(p_t). Down-weights easy/well-classified examples (where p_t is high) and focuses on hard examples. Gamma=2 is standard. Designed for one-stage detectors to handle extreme foreground-background class imbalance. |
| 15 | What is the difference between GANs and diffusion models for image generation? | GANs: adversarial training (generator vs discriminator), single-step generation, fast inference but unstable training and mode collapse. Diffusion: iterative denoising, stable training, better diversity and quality, but slower inference (20-50 steps). Diffusion models dominate since 2022. |
Coding Challenges
These are actual coding tasks you might encounter in a CV interview. Practice implementing them without referring to documentation.
Challenge 1: Implement a Convolution Operation from Scratch
import numpy as np
def conv2d(image, kernel, stride=1, padding=0):
"""Implement 2D convolution from scratch.
Args:
image: (H, W) numpy array
kernel: (kH, kW) numpy array
stride: step size
padding: zero-padding around the image
Returns:
(out_H, out_W) convolved output
"""
# Add padding
if padding > 0:
image = np.pad(image, padding, mode='constant', constant_values=0)
H, W = image.shape
kH, kW = kernel.shape
out_H = (H - kH) // stride + 1
out_W = (W - kW) // stride + 1
output = np.zeros((out_H, out_W))
for i in range(out_H):
for j in range(out_W):
# Extract the receptive field
region = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
# Element-wise multiply and sum
output[i, j] = np.sum(region * kernel)
return output
# Test with edge detection kernel
image = np.random.rand(8, 8)
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
edges = conv2d(image, sobel_x, stride=1, padding=1)
print(f"Input: {image.shape}, Output: {edges.shape}") # Both (8, 8)
Challenge 2: Implement a Custom Dataset with Augmentation in PyTorch
import torch
from torch.utils.data import Dataset
from torchvision import transforms as T
from PIL import Image
import os
class CustomImageDataset(Dataset):
def __init__(self, root_dir, split="train"):
self.root_dir = root_dir
self.split = split
# Collect image paths and labels
self.samples = []
self.class_to_idx = {}
for idx, class_name in enumerate(sorted(os.listdir(root_dir))):
class_dir = os.path.join(root_dir, class_name)
if not os.path.isdir(class_dir):
continue
self.class_to_idx[class_name] = idx
for img_name in os.listdir(class_dir):
if img_name.lower().endswith(('.jpg', '.jpeg', '.png')):
self.samples.append((
os.path.join(class_dir, img_name), idx
))
# Different transforms for train vs val
if split == "train":
self.transform = T.Compose([
T.RandomResizedCrop(224, scale=(0.08, 1.0)),
T.RandomHorizontalFlip(),
T.ColorJitter(0.4, 0.4, 0.4, 0.1),
T.RandomGrayscale(p=0.1),
T.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
T.RandomErasing(p=0.25),
])
else:
self.transform = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
path, label = self.samples[idx]
image = Image.open(path).convert("RGB")
image = self.transform(image)
return image, label
# Usage
train_ds = CustomImageDataset("/data/images", split="train")
val_ds = CustomImageDataset("/data/images", split="val")
print(f"Train: {len(train_ds)} images, {len(train_ds.class_to_idx)} classes")
Challenge 3: Implement mAP Calculation
import numpy as np
def compute_ap(recalls, precisions):
"""Compute Average Precision using all-point interpolation."""
# Add sentinel values
recalls = np.concatenate(([0.0], recalls, [1.0]))
precisions = np.concatenate(([1.0], precisions, [0.0]))
# Make precision monotonically decreasing (right to left)
for i in range(len(precisions) - 2, -1, -1):
precisions[i] = max(precisions[i], precisions[i + 1])
# Find points where recall changes
change_points = np.where(recalls[1:] != recalls[:-1])[0] + 1
# Sum (delta_recall * precision)
ap = np.sum((recalls[change_points] - recalls[change_points - 1]) *
precisions[change_points])
return ap
def compute_map(all_detections, all_ground_truths, iou_threshold=0.5):
"""Compute mAP across all classes.
Args:
all_detections: dict of {class: [(image_id, confidence, box), ...]}
all_ground_truths: dict of {class: {image_id: [box, ...], ...}}
Returns:
mAP value
"""
aps = []
for cls in all_ground_truths:
dets = sorted(all_detections.get(cls, []),
key=lambda x: x[1], reverse=True)
# Count total ground truths for this class
n_gt = sum(len(boxes) for boxes in all_ground_truths[cls].values())
if n_gt == 0:
continue
# Track which GTs have been matched
matched = {img_id: [False] * len(boxes)
for img_id, boxes in all_ground_truths[cls].items()}
tp = np.zeros(len(dets))
fp = np.zeros(len(dets))
for i, (img_id, conf, det_box) in enumerate(dets):
gt_boxes = all_ground_truths[cls].get(img_id, [])
best_iou = 0
best_gt_idx = -1
for j, gt_box in enumerate(gt_boxes):
iou = compute_single_iou(det_box, gt_box)
if iou > best_iou:
best_iou = iou
best_gt_idx = j
if best_iou >= iou_threshold and not matched[img_id][best_gt_idx]:
tp[i] = 1
matched[img_id][best_gt_idx] = True
else:
fp[i] = 1
tp_cumsum = np.cumsum(tp)
fp_cumsum = np.cumsum(fp)
recalls = tp_cumsum / n_gt
precisions = tp_cumsum / (tp_cumsum + fp_cumsum)
ap = compute_ap(recalls, precisions)
aps.append(ap)
return np.mean(aps) if aps else 0.0
Interview Strategy Tips
Draw Architecture Diagrams
For any architecture question (ResNet, YOLO, U-Net, Faster R-CNN), draw the diagram first. Show the data flow, skip connections, feature map dimensions, and where each loss is applied. A clear diagram communicates more than 10 minutes of verbal explanation.
Know Your Numbers
Memorize key benchmarks: ResNet-50 = 76% ImageNet top-1, 25.6M params. YOLOv8-n = 37.3 COCO mAP at 80 FPS. ViT-B = 81.8% ImageNet. U-Net dice scores for common medical tasks. Being able to cite numbers shows deep familiarity.
Discuss Trade-offs First
When asked "which model should I use?", never give one answer. Ask about latency constraints, accuracy requirements, hardware targets, and dataset size. Then present 2–3 options with trade-offs. This is the hallmark of senior-level thinking.
Show Production Awareness
Always connect model design to production reality. When discussing an architecture, mention: inference latency, model size, quantization compatibility, edge deployment feasibility, and monitoring strategy. This separates you from academic-only candidates.
Prepare Your Project Deep Dives
For every project you mention, be ready to go 3 levels deep. "I trained YOLOv8" leads to: "What input resolution? What augmentation? How did you handle small objects? What was the mAP? How did you deploy? What was P95 latency?" Have specific numbers and decisions ready.
Code Without Documentation
Practice writing PyTorch code from memory: a training loop, custom dataset, data augmentation pipeline, NMS, IoU computation, and basic architectures. Interviewers expect you to write these fluently without Googling. Time yourself: each should take under 10 minutes.
Frequently Asked Questions
How many hours should I prepare for a CV interview?
Plan for 40–60 hours spread over 3–4 weeks. Break it down: 15 hours on CNN architectures and classification, 15 hours on detection/segmentation, 10 hours on advanced topics (ViT, self-supervised, GANs), and 10 hours on coding practice and mock interviews. If you work with CV daily, focus on gaps: most practitioners are weak on recent advances (ViT, diffusion models, SAM) and production deployment (TensorRT, quantization).
Do I need to know both PyTorch and TensorFlow?
Focus on PyTorch. It is the dominant framework for CV research and increasingly for production (via TorchScript, TensorRT, and ONNX export). Most interview questions and code exercises assume PyTorch. Knowing torchvision (models, transforms, datasets) and the Ultralytics YOLOv8 API is expected. TensorFlow knowledge is a bonus but rarely required unless the job description specifically mentions it.
Should I focus on classical CV (OpenCV) or deep learning?
Spend 80% of your time on deep learning (CNN architectures, detection, segmentation, vision transformers) and 20% on classical concepts (convolution, edge detection, image filtering, morphological operations). Classical CV tests your fundamentals and is still used in preprocessing pipelines. But the majority of interview time will be on deep learning approaches, especially at companies building ML-first products.
What if I am asked about a model or paper I do not know?
Be honest: "I have not read that specific paper, but based on the name, it likely addresses [problem X]." Then pivot to what you know: "In related work, I am familiar with [model Y] which solves a similar problem by..." Interviewers respect honesty and first-principles reasoning. The worst thing you can do is fake knowledge — experienced interviewers will detect it immediately and it destroys trust.
How important is math for CV interviews?
You need to understand the math behind key operations: convolution (element-wise multiply and sum), backpropagation through conv layers, IoU computation, attention mechanism (Q*K^T / sqrt(d)), loss functions (cross-entropy, Dice, focal), and evaluation metrics (mAP, mIoU). You do not need to derive everything from scratch, but you should be able to explain the intuition and compute simple examples by hand (e.g., the output of a 3x3 convolution on a 5x5 input).
What is the most common mistake in CV system design interviews?
The most common mistake is jumping to a model architecture without discussing the full system. Interviewers want to see: (1) data collection and labeling strategy, (2) data pipeline and preprocessing, (3) model selection with justification, (4) training infrastructure and experiment tracking, (5) evaluation methodology (offline + online), (6) deployment architecture (edge vs cloud, latency budget), (7) monitoring and retraining strategy. Many candidates skip steps 1, 6, and 7, which are exactly what production engineers care most about.
How do I prepare for the behavioral round in a CV position?
Prepare 5–6 STAR stories from CV projects. Each story should include: (1) Situation: "Our defect detection model had 73% recall, causing 27% of defects to reach customers." (2) Task: "I needed to improve recall to 95%+ without significantly increasing false positives." (3) Action: "I redesigned the data pipeline to include copy-paste augmentation for rare defect types, switched from ResNet to EfficientNet with focal loss, and implemented active learning to prioritize labeling edge cases." (4) Result: "Achieved 97.2% recall with 94% precision. Reduced escaped defects by 89%, saving $2.3M annually." Always quantify impact.
Should I mention limitations and risks in my answers?
Absolutely. Discussing limitations shows maturity and real-world experience. When discussing object detection, mention failure modes (small objects, occlusion, adversarial examples). When discussing deployment, mention model drift, calibration degradation, and edge case handling. When discussing data augmentation, mention when certain augmentations are harmful. Interviewers specifically look for candidates who understand what can go wrong, not just the happy path. This is especially important for safety-critical applications (autonomous driving, medical imaging).
Final Checklist
- Explain convolution, pooling, batch normalization, and skip connections from first principles
- Trace the architecture evolution from AlexNet to ResNet to EfficientNet to ConvNeXt
- Compare YOLO vs Faster R-CNN vs DETR: architectures, speed-accuracy trade-offs, and use cases
- Explain U-Net, Mask R-CNN, and the difference between semantic/instance/panoptic segmentation
- Discuss ViT, DINO, CLIP, and SAM: how they work and when to use them
- Implement NMS, IoU computation, a training loop, and a custom dataset in PyTorch from memory
- Design a production CV pipeline: data labeling, training, evaluation, deployment, monitoring
- Explain model quantization (PTQ vs QAT), TensorRT optimization, and edge deployment
- Discuss 3 recent developments: vision transformers, diffusion models, foundation models (SAM, DINOv2)
- Tell 3 project stories with specific metrics, decisions, and trade-offs
- Handle data augmentation design for your specific domain (medical, autonomous driving, etc.)
- Reason about trade-offs: accuracy vs latency, edge vs cloud, fine-tuning vs training from scratch
Lilly Tech Systems