Introduction to Computer Vision
Computer Vision enables machines to interpret and understand visual information from the world — images, videos, and real-time camera feeds.
What is Computer Vision?
Computer Vision (CV) is a field of artificial intelligence that trains computers to interpret and understand visual information. While humans effortlessly recognize objects, read text, and navigate environments using sight, enabling machines to do the same is an extraordinarily complex challenge.
CV systems take images or videos as input and produce meaningful outputs — classifications, bounding boxes, pixel-level labels, 3D models, or textual descriptions of visual content.
How Computers "See"
To a computer, an image is simply a grid of numbers:
- Pixels: The fundamental unit of an image. Each pixel stores color intensity values, typically ranging from 0 (black) to 255 (white).
- Channels: Color images have multiple channels. An RGB image has 3 channels (Red, Green, Blue). A grayscale image has 1 channel.
- Resolution: The dimensions of the image in pixels (e.g., 1920x1080 means 1920 columns and 1080 rows).
import numpy as np from PIL import Image # Load an image img = Image.open("photo.jpg") pixels = np.array(img) print(f"Shape: {pixels.shape}") # (height, width, channels) print(f"Data type: {pixels.dtype}") # uint8 (0-255) print(f"Min: {pixels.min()}, Max: {pixels.max()}") # A single pixel's RGB values print(f"Pixel at (100,50): {pixels[100, 50]}") # e.g., [142, 178, 210] -- R=142, G=178, B=210
Color Spaces
| Color Space | Channels | Use Case |
|---|---|---|
| RGB | Red, Green, Blue | Standard display format, most common |
| HSV | Hue, Saturation, Value | Color-based filtering and segmentation |
| Grayscale | Single intensity channel | Edge detection, feature extraction |
| LAB | Lightness, A (green-red), B (blue-yellow) | Color correction, perceptual uniformity |
CV vs Image Processing vs Computer Graphics
| Field | Input | Output | Goal |
|---|---|---|---|
| Image Processing | Image | Image | Enhance, filter, transform images |
| Computer Vision | Image/Video | Understanding | Extract meaning from visual data |
| Computer Graphics | Data/Models | Image/Video | Create visual content from data |
Applications of Computer Vision
- Face Recognition: Unlocking phones, identity verification, photo organization. Used by Apple Face ID, social media tagging, and security systems.
- Autonomous Driving: Self-driving cars use cameras, lidar, and radar with CV to detect lanes, vehicles, pedestrians, and traffic signs.
- Medical Imaging: AI analyzes X-rays, MRIs, CT scans, and pathology slides to detect diseases, tumors, and anomalies.
- Manufacturing Inspection: CV systems inspect products on assembly lines for defects, measuring quality at speeds impossible for humans.
- Augmented Reality (AR/VR): CV powers AR experiences by understanding the 3D environment, tracking objects, and overlaying digital content.
- Document Analysis: OCR, document classification, form extraction, and receipt scanning.
- Agriculture: Drone-based crop monitoring, disease detection, and yield estimation.
History and Evolution
1960s — Early Research
MIT's "Summer Vision Project" (1966) aimed to build a visual system in one summer. The problem proved far harder than expected.
1980s–1990s — Feature-Based Methods
Hand-crafted features like edges, corners (Harris), and SIFT descriptors enabled basic recognition.
2000s — Machine Learning Era
HOG + SVM for pedestrian detection, Viola-Jones for face detection, and bag-of-visual-words approaches.
2012 — Deep Learning Revolution
AlexNet won ImageNet by a huge margin, proving CNNs were superior for visual tasks. This moment changed CV forever.
2020s — Foundation Models
Vision Transformers (ViT), CLIP, SAM, and multimodal models that combine vision with language understanding.
Lilly Tech Systems