Advanced

Practice Questions & Tips

This final lesson brings everything together with rapid-fire questions to test your knowledge, coding challenges to practice, and strategic tips from successful CV interview candidates.

Rapid-Fire Questions

Time yourself: try to answer each in under 60 seconds. These test breadth of knowledge and quick recall — both critical for phone screens and early interview rounds.

#QuestionExpected Answer (1–2 sentences)
1What does a 3x3 convolution with stride 2 do to the spatial dimensions?It halves the spatial dimensions (with appropriate padding). Output size = floor((input + 2*padding - kernel) / stride + 1). For 224x224 input: output is 112x112.
2What is the receptive field of a neuron?The region in the input image that affects that neuron's activation. It grows with network depth: each 3x3 conv layer adds 2 pixels to the receptive field. A 5-layer network of 3x3 convs has a receptive field of 11x11.
3Why does ResNet use a bottleneck (1x1 → 3x3 → 1x1)?The first 1x1 reduces channels (e.g., 256 to 64), the 3x3 operates on fewer channels (cheap), and the second 1x1 expands back. This gives the same representational power with ~4x fewer FLOPs than using two 3x3 convs on the full channel count.
4What is the difference between IoU and Dice score?IoU = intersection/union. Dice = 2*intersection/(sum of areas). Dice = 2*IoU/(1+IoU). Dice is always higher than IoU for the same prediction. Dice of 0.90 = IoU of 0.818.
5What is NMS and why is it needed?Non-Maximum Suppression removes duplicate detections by keeping the highest-confidence box and suppressing overlapping boxes (IoU > threshold). Needed because detectors produce multiple predictions for the same object from nearby anchor positions.
6What is the difference between semantic and instance segmentation?Semantic: assigns a class label to every pixel but does not distinguish between instances (all people get the same label). Instance: detects each object and provides a separate mask per instance (person 1, person 2). Instance only covers "things," not "stuff."
7What makes depthwise separable convolutions efficient?Factoring a standard conv into depthwise (one filter per channel, spatial) + pointwise (1x1, channel mixing). Reduces FLOPs by ~K^2 (9x for 3x3 kernels) with minimal accuracy loss.
8How does FPN help detect small objects?FPN creates high-resolution feature maps (1/4 or 1/8 scale) enriched with high-level semantic information via top-down connections. Small objects are detected on these fine-grained features that still "know" what they are looking at.
9What is anchor-free detection?Predicting objects without predefined anchor boxes. Instead, directly predict center points (CenterNet), per-pixel distances to box edges (FCOS), or object queries (DETR). Simpler, no anchor hyperparameters to tune. Used in YOLOv8.
10Name three ways to speed up inference for a CV model.1) Quantization (FP32 → INT8, ~4x speedup). 2) TensorRT/ONNX Runtime graph optimization (operator fusion, kernel tuning, ~2-5x). 3) Reduce input resolution (640 → 320, ~4x speedup). Also: pruning, knowledge distillation, smaller architecture.
11What is RoI Align and why is it better than RoI Pooling?RoI Align uses bilinear interpolation to sample feature values at exact floating-point positions, avoiding the quantization artifacts of RoI Pooling. This gives ~2-3 mAP improvement for instance segmentation where pixel-level alignment matters.
12What is CIoU loss?Complete IoU loss adds center distance penalty and aspect ratio consistency to standard IoU loss: L = 1 - IoU + distance_penalty + aspect_ratio_penalty. Converges faster and produces better-localized boxes than L1 or vanilla IoU loss.
13How does Vision Transformer (ViT) process images?Splits image into fixed-size patches (16x16), flattens and linearly projects each patch into a token, adds positional embeddings, prepends a [CLS] token, and processes through standard transformer encoder layers. Classification uses the [CLS] output.
14What is Focal Loss?FL = -(1-p_t)^gamma * log(p_t). Down-weights easy/well-classified examples (where p_t is high) and focuses on hard examples. Gamma=2 is standard. Designed for one-stage detectors to handle extreme foreground-background class imbalance.
15What is the difference between GANs and diffusion models for image generation?GANs: adversarial training (generator vs discriminator), single-step generation, fast inference but unstable training and mode collapse. Diffusion: iterative denoising, stable training, better diversity and quality, but slower inference (20-50 steps). Diffusion models dominate since 2022.

Coding Challenges

These are actual coding tasks you might encounter in a CV interview. Practice implementing them without referring to documentation.

Challenge 1: Implement a Convolution Operation from Scratch

import numpy as np

def conv2d(image, kernel, stride=1, padding=0):
    """Implement 2D convolution from scratch.

    Args:
        image: (H, W) numpy array
        kernel: (kH, kW) numpy array
        stride: step size
        padding: zero-padding around the image

    Returns:
        (out_H, out_W) convolved output
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant', constant_values=0)

    H, W = image.shape
    kH, kW = kernel.shape
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1

    output = np.zeros((out_H, out_W))

    for i in range(out_H):
        for j in range(out_W):
            # Extract the receptive field
            region = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            # Element-wise multiply and sum
            output[i, j] = np.sum(region * kernel)

    return output


# Test with edge detection kernel
image = np.random.rand(8, 8)
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
edges = conv2d(image, sobel_x, stride=1, padding=1)
print(f"Input: {image.shape}, Output: {edges.shape}")  # Both (8, 8)

Challenge 2: Implement a Custom Dataset with Augmentation in PyTorch

import torch
from torch.utils.data import Dataset
from torchvision import transforms as T
from PIL import Image
import os

class CustomImageDataset(Dataset):
    def __init__(self, root_dir, split="train"):
        self.root_dir = root_dir
        self.split = split

        # Collect image paths and labels
        self.samples = []
        self.class_to_idx = {}
        for idx, class_name in enumerate(sorted(os.listdir(root_dir))):
            class_dir = os.path.join(root_dir, class_name)
            if not os.path.isdir(class_dir):
                continue
            self.class_to_idx[class_name] = idx
            for img_name in os.listdir(class_dir):
                if img_name.lower().endswith(('.jpg', '.jpeg', '.png')):
                    self.samples.append((
                        os.path.join(class_dir, img_name), idx
                    ))

        # Different transforms for train vs val
        if split == "train":
            self.transform = T.Compose([
                T.RandomResizedCrop(224, scale=(0.08, 1.0)),
                T.RandomHorizontalFlip(),
                T.ColorJitter(0.4, 0.4, 0.4, 0.1),
                T.RandomGrayscale(p=0.1),
                T.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
                T.ToTensor(),
                T.Normalize([0.485, 0.456, 0.406],
                            [0.229, 0.224, 0.225]),
                T.RandomErasing(p=0.25),
            ])
        else:
            self.transform = T.Compose([
                T.Resize(256),
                T.CenterCrop(224),
                T.ToTensor(),
                T.Normalize([0.485, 0.456, 0.406],
                            [0.229, 0.224, 0.225]),
            ])

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        path, label = self.samples[idx]
        image = Image.open(path).convert("RGB")
        image = self.transform(image)
        return image, label


# Usage
train_ds = CustomImageDataset("/data/images", split="train")
val_ds = CustomImageDataset("/data/images", split="val")
print(f"Train: {len(train_ds)} images, {len(train_ds.class_to_idx)} classes")

Challenge 3: Implement mAP Calculation

import numpy as np

def compute_ap(recalls, precisions):
    """Compute Average Precision using all-point interpolation."""
    # Add sentinel values
    recalls = np.concatenate(([0.0], recalls, [1.0]))
    precisions = np.concatenate(([1.0], precisions, [0.0]))

    # Make precision monotonically decreasing (right to left)
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = max(precisions[i], precisions[i + 1])

    # Find points where recall changes
    change_points = np.where(recalls[1:] != recalls[:-1])[0] + 1

    # Sum (delta_recall * precision)
    ap = np.sum((recalls[change_points] - recalls[change_points - 1]) *
                precisions[change_points])
    return ap

def compute_map(all_detections, all_ground_truths, iou_threshold=0.5):
    """Compute mAP across all classes.

    Args:
        all_detections: dict of {class: [(image_id, confidence, box), ...]}
        all_ground_truths: dict of {class: {image_id: [box, ...], ...}}

    Returns:
        mAP value
    """
    aps = []

    for cls in all_ground_truths:
        dets = sorted(all_detections.get(cls, []),
                       key=lambda x: x[1], reverse=True)

        # Count total ground truths for this class
        n_gt = sum(len(boxes) for boxes in all_ground_truths[cls].values())
        if n_gt == 0:
            continue

        # Track which GTs have been matched
        matched = {img_id: [False] * len(boxes)
                   for img_id, boxes in all_ground_truths[cls].items()}

        tp = np.zeros(len(dets))
        fp = np.zeros(len(dets))

        for i, (img_id, conf, det_box) in enumerate(dets):
            gt_boxes = all_ground_truths[cls].get(img_id, [])
            best_iou = 0
            best_gt_idx = -1

            for j, gt_box in enumerate(gt_boxes):
                iou = compute_single_iou(det_box, gt_box)
                if iou > best_iou:
                    best_iou = iou
                    best_gt_idx = j

            if best_iou >= iou_threshold and not matched[img_id][best_gt_idx]:
                tp[i] = 1
                matched[img_id][best_gt_idx] = True
            else:
                fp[i] = 1

        tp_cumsum = np.cumsum(tp)
        fp_cumsum = np.cumsum(fp)
        recalls = tp_cumsum / n_gt
        precisions = tp_cumsum / (tp_cumsum + fp_cumsum)

        ap = compute_ap(recalls, precisions)
        aps.append(ap)

    return np.mean(aps) if aps else 0.0

Interview Strategy Tips

Draw Architecture Diagrams

For any architecture question (ResNet, YOLO, U-Net, Faster R-CNN), draw the diagram first. Show the data flow, skip connections, feature map dimensions, and where each loss is applied. A clear diagram communicates more than 10 minutes of verbal explanation.

Know Your Numbers

Memorize key benchmarks: ResNet-50 = 76% ImageNet top-1, 25.6M params. YOLOv8-n = 37.3 COCO mAP at 80 FPS. ViT-B = 81.8% ImageNet. U-Net dice scores for common medical tasks. Being able to cite numbers shows deep familiarity.

Discuss Trade-offs First

When asked "which model should I use?", never give one answer. Ask about latency constraints, accuracy requirements, hardware targets, and dataset size. Then present 2–3 options with trade-offs. This is the hallmark of senior-level thinking.

Show Production Awareness

Always connect model design to production reality. When discussing an architecture, mention: inference latency, model size, quantization compatibility, edge deployment feasibility, and monitoring strategy. This separates you from academic-only candidates.

Prepare Your Project Deep Dives

For every project you mention, be ready to go 3 levels deep. "I trained YOLOv8" leads to: "What input resolution? What augmentation? How did you handle small objects? What was the mAP? How did you deploy? What was P95 latency?" Have specific numbers and decisions ready.

Code Without Documentation

Practice writing PyTorch code from memory: a training loop, custom dataset, data augmentation pipeline, NMS, IoU computation, and basic architectures. Interviewers expect you to write these fluently without Googling. Time yourself: each should take under 10 minutes.

Frequently Asked Questions

How many hours should I prepare for a CV interview?

Plan for 40–60 hours spread over 3–4 weeks. Break it down: 15 hours on CNN architectures and classification, 15 hours on detection/segmentation, 10 hours on advanced topics (ViT, self-supervised, GANs), and 10 hours on coding practice and mock interviews. If you work with CV daily, focus on gaps: most practitioners are weak on recent advances (ViT, diffusion models, SAM) and production deployment (TensorRT, quantization).

Do I need to know both PyTorch and TensorFlow?

Focus on PyTorch. It is the dominant framework for CV research and increasingly for production (via TorchScript, TensorRT, and ONNX export). Most interview questions and code exercises assume PyTorch. Knowing torchvision (models, transforms, datasets) and the Ultralytics YOLOv8 API is expected. TensorFlow knowledge is a bonus but rarely required unless the job description specifically mentions it.

Should I focus on classical CV (OpenCV) or deep learning?

Spend 80% of your time on deep learning (CNN architectures, detection, segmentation, vision transformers) and 20% on classical concepts (convolution, edge detection, image filtering, morphological operations). Classical CV tests your fundamentals and is still used in preprocessing pipelines. But the majority of interview time will be on deep learning approaches, especially at companies building ML-first products.

What if I am asked about a model or paper I do not know?

Be honest: "I have not read that specific paper, but based on the name, it likely addresses [problem X]." Then pivot to what you know: "In related work, I am familiar with [model Y] which solves a similar problem by..." Interviewers respect honesty and first-principles reasoning. The worst thing you can do is fake knowledge — experienced interviewers will detect it immediately and it destroys trust.

How important is math for CV interviews?

You need to understand the math behind key operations: convolution (element-wise multiply and sum), backpropagation through conv layers, IoU computation, attention mechanism (Q*K^T / sqrt(d)), loss functions (cross-entropy, Dice, focal), and evaluation metrics (mAP, mIoU). You do not need to derive everything from scratch, but you should be able to explain the intuition and compute simple examples by hand (e.g., the output of a 3x3 convolution on a 5x5 input).

What is the most common mistake in CV system design interviews?

The most common mistake is jumping to a model architecture without discussing the full system. Interviewers want to see: (1) data collection and labeling strategy, (2) data pipeline and preprocessing, (3) model selection with justification, (4) training infrastructure and experiment tracking, (5) evaluation methodology (offline + online), (6) deployment architecture (edge vs cloud, latency budget), (7) monitoring and retraining strategy. Many candidates skip steps 1, 6, and 7, which are exactly what production engineers care most about.

How do I prepare for the behavioral round in a CV position?

Prepare 5–6 STAR stories from CV projects. Each story should include: (1) Situation: "Our defect detection model had 73% recall, causing 27% of defects to reach customers." (2) Task: "I needed to improve recall to 95%+ without significantly increasing false positives." (3) Action: "I redesigned the data pipeline to include copy-paste augmentation for rare defect types, switched from ResNet to EfficientNet with focal loss, and implemented active learning to prioritize labeling edge cases." (4) Result: "Achieved 97.2% recall with 94% precision. Reduced escaped defects by 89%, saving $2.3M annually." Always quantify impact.

Should I mention limitations and risks in my answers?

Absolutely. Discussing limitations shows maturity and real-world experience. When discussing object detection, mention failure modes (small objects, occlusion, adversarial examples). When discussing deployment, mention model drift, calibration degradation, and edge case handling. When discussing data augmentation, mention when certain augmentations are harmful. Interviewers specifically look for candidates who understand what can go wrong, not just the happy path. This is especially important for safety-critical applications (autonomous driving, medical imaging).

Final Checklist

💡
Before your interview, make sure you can:
  • Explain convolution, pooling, batch normalization, and skip connections from first principles
  • Trace the architecture evolution from AlexNet to ResNet to EfficientNet to ConvNeXt
  • Compare YOLO vs Faster R-CNN vs DETR: architectures, speed-accuracy trade-offs, and use cases
  • Explain U-Net, Mask R-CNN, and the difference between semantic/instance/panoptic segmentation
  • Discuss ViT, DINO, CLIP, and SAM: how they work and when to use them
  • Implement NMS, IoU computation, a training loop, and a custom dataset in PyTorch from memory
  • Design a production CV pipeline: data labeling, training, evaluation, deployment, monitoring
  • Explain model quantization (PTQ vs QAT), TensorRT optimization, and edge deployment
  • Discuss 3 recent developments: vision transformers, diffusion models, foundation models (SAM, DINOv2)
  • Tell 3 project stories with specific metrics, decisions, and trade-offs
  • Handle data augmentation design for your specific domain (medical, autonomous driving, etc.)
  • Reason about trade-offs: accuracy vs latency, edge vs cloud, fine-tuning vs training from scratch
💡
Good luck with your CV interview! Remember: the goal is not to memorize every answer in this course. It is to understand the concepts deeply enough that you can reason about novel questions from first principles. If you can explain why an architecture works (not just what it does), you will stand out from 90% of candidates.