CV Interview Overview
Computer vision remains one of the most in-demand ML specializations. Whether you are targeting autonomous driving, medical imaging, robotics, or content understanding roles, this lesson maps the interview landscape so you know exactly what to prepare for in 2024–2026.
How CV Interviews Have Evolved
Computer vision interviews have shifted significantly since the rise of foundation models and vision transformers. Here is how expectations have changed.
| Aspect | Classical CV (Pre-2020) | Modern CV (2022–2026) |
|---|---|---|
| Core Knowledge | SIFT, HOG, edge detection, image filtering, SVMs | CNNs, vision transformers (ViT), foundation models (SAM, DINO), diffusion models |
| Model Training | Train from scratch on small labeled datasets | Fine-tune pretrained backbones, self-supervised pretraining, few-shot learning |
| Coding Questions | Implement convolution, edge detector, HOG descriptor | Build data pipeline with augmentation, implement custom loss, use torchvision |
| System Design | Build image search, face recognition pipeline | Design real-time detection system, multi-camera tracking, edge deployment architecture |
| Evaluation | Accuracy, confusion matrix | mAP, IoU, FID, per-class metrics, calibration, robustness to distribution shift |
| Production Skills | OpenCV pipelines, batch processing | TensorRT, ONNX, model quantization, edge deployment, video streaming, MLOps |
CV Role Types and What They Test
Different CV roles emphasize different skill sets. Identify your target role to focus your preparation effectively.
CV Research Scientist
Focus: Novel architectures, training methodology, loss functions, benchmark results. Expect deep questions on attention in vision, self-supervised learning, and paper reproduction.
Companies: Google DeepMind, Meta FAIR, NVIDIA Research, Microsoft Research, Apple MLR
CV/ML Engineer
Focus: Building production CV pipelines. Model training, data augmentation, evaluation, deployment, and monitoring. System design and coding rounds alongside ML theory.
Companies: Tesla, Waymo, Amazon, Apple, Meta, Google, Netflix
Perception Engineer
Focus: Autonomous systems — 3D perception, sensor fusion (camera + LiDAR + radar), tracking, SLAM. Heavy emphasis on real-time performance and safety-critical systems.
Companies: Waymo, Cruise, Aurora, Zoox, Tesla, Motional, Nuro
Applied CV Scientist
Focus: Applying CV to specific domains: medical imaging, satellite imagery, retail, manufacturing inspection. Domain knowledge matters as much as CV expertise.
Companies: Tempus, PathAI, Planet Labs, Amazon Go, Landing AI
Typical Interview Format
Most CV interviews at top companies follow this structure across 4–6 rounds:
| Round | Duration | What They Test | How to Prepare |
|---|---|---|---|
| Phone Screen | 45–60 min | CV fundamentals, basic coding, motivation | Review Lessons 1–2 of this course. Practice explaining CNN architectures in 2–3 minutes. |
| Coding Round | 45–60 min | Implement CV algorithms, data pipelines, use PyTorch/torchvision | Practice implementing data augmentation pipelines, custom datasets, and training loops. |
| ML/CV Deep Dive | 45–60 min | Architecture details, loss functions, training strategies, recent advances | Review Lessons 2–5. Be ready to whiteboard convolution math and detection architectures. |
| System Design | 45–60 min | Design CV systems at scale: real-time detection, video analytics, image search | Practice end-to-end: data pipeline, model serving, latency budgets, edge vs cloud trade-offs. |
| Behavioral | 30–45 min | Past projects, conflict resolution, leadership, handling ambiguity | Prepare 5–6 STAR stories from CV projects. Quantify impact (mAP +12%, latency reduced 60%). |
What Companies Actually Want
Based on interview feedback from top CV teams, here is what separates "hire" from "no hire" candidates:
- Depth on architectures: Can you explain why ResNet uses skip connections from first principles? Not just "it solves vanishing gradients" but the actual gradient flow analysis and how identity mappings help optimization.
- Production mindset: You do not just train models — you think about inference latency, model size, quantization trade-offs, edge deployment constraints, and data pipeline robustness.
- Trade-off reasoning: When asked "YOLO or Faster R-CNN?", you do not give one answer. You ask about latency requirements, accuracy targets, hardware constraints, and use case before recommending an approach.
- Data-centric thinking: You understand that data quality often matters more than model architecture. You can discuss data augmentation strategies, labeling pipelines, handling class imbalance, and active learning.
- Current awareness: You know about vision transformers, SAM, DINOv2, diffusion models, and can discuss when they outperform CNNs and when they do not.
Preparation Strategy
Here is a structured 3-week plan to prepare for CV interviews using this course:
Week 1: Foundations
Complete Lessons 1–2. Focus on CNN architectures (ResNet, EfficientNet), convolution math, pooling, batch normalization, and transfer learning. Write code for a custom image classifier from scratch using PyTorch.
Week 2: Detection & Segmentation
Complete Lessons 3–4. Study object detection (YOLO, Faster R-CNN, anchor boxes, NMS, mAP) and segmentation (U-Net, Mask R-CNN, panoptic). Implement NMS from scratch and train a simple detector.
Week 3: Advanced & Practice
Complete Lessons 5–7. Cover vision transformers, GANs, practical deployment, and rapid-fire questions. Do 2 full mock interviews. Review weak areas and refine your project stories.
Key Takeaways
- Modern CV interviews focus 60% on deep learning (CNNs, ViT) and 40% on practical deployment and classical foundations
- Know which role type you are targeting — research scientist, CV engineer, perception engineer, or applied scientist
- Companies want architecture depth, production mindset, trade-off reasoning, data-centric thinking, and current awareness
- Follow the 3-week preparation plan: foundations, detection/segmentation, then advanced topics and practice
- Practice whiteboarding architectures and explaining concepts out loud — reading is not enough
Lilly Tech Systems