Pretrained Vision Models Intermediate
A comprehensive directory of pretrained computer vision models for image classification, object detection, segmentation, and image generation — with practical code examples for each.
Image Classification
ResNet (Residual Network)
The pioneering deep CNN architecture that introduced skip connections. Available in ResNet-18, 34, 50, 101, and 152 variants. Trained on ImageNet (1.2M images, 1000 classes).
import torchvision.models as models from torchvision import transforms from PIL import Image # Load pretrained ResNet-50 model = models.resnet50(weights="IMAGENET1K_V2") model.eval() # Preprocess image preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) img = preprocess(Image.open("photo.jpg")).unsqueeze(0) output = model(img)
EfficientNet
Optimized for accuracy-to-compute ratio using neural architecture search. EfficientNet-B0 through B7 offer increasing accuracy with increasing compute.
Vision Transformer (ViT)
Applies the Transformer architecture to images by splitting them into patches. Achieves state-of-the-art accuracy with enough data.
from transformers import pipeline classifier = pipeline("image-classification", model="google/vit-base-patch16-224") result = classifier("photo.jpg") print(result) # [{'label': 'golden retriever', 'score': 0.95}, ...]
ConvNeXt
A modernized CNN that matches ViT performance by applying Transformer design principles to convolutional architectures. Simpler and faster to train.
Object Detection
YOLO (You Only Look Once)
The fastest real-time object detection family. YOLOv5, YOLOv8, and YOLOv11 (by Ultralytics) are the most popular versions.
from ultralytics import YOLO # Load pretrained YOLOv8 model = YOLO("yolov8n.pt") # nano, small, medium, large, xlarge # Run detection results = model("image.jpg") results[0].show() # Display annotated image
DETR (Detection Transformer)
Facebook's end-to-end object detection with Transformers. No need for anchor boxes or non-maximum suppression.
Faster R-CNN
Two-stage detector that first proposes regions, then classifies them. Higher accuracy than single-stage detectors but slower.
Segmentation
SAM (Segment Anything Model)
Meta's foundation model for image segmentation. Can segment any object with a point, box, or text prompt.
from segment_anything import SamPredictor, sam_model_registry sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth") predictor = SamPredictor(sam) predictor.set_image(image) masks, scores, logits = predictor.predict(point_coords=points, point_labels=labels)
Mask R-CNN
Extends Faster R-CNN with instance segmentation. Detects objects and generates pixel-level masks for each instance.
DeepLab
Google's semantic segmentation model using atrous convolutions. Classifies every pixel in the image into predefined categories.
Image Generation
Stable Diffusion
Open-source text-to-image model by Stability AI. Generates high-quality images from text descriptions.
from diffusers import StableDiffusionPipeline import torch pipe = StableDiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") image = pipe("A serene mountain lake at sunset, photorealistic").images[0] image.save("output.png")
Vision Models Comparison
| Model | Task | Size | Speed | Best For |
|---|---|---|---|---|
| ResNet-50 | Classification | 25M params | Fast | Baseline, transfer learning |
| ViT-Base | Classification | 86M params | Medium | High accuracy |
| YOLOv8-n | Detection | 3M params | Very fast | Real-time detection |
| SAM-ViT-H | Segmentation | 636M params | Medium | Universal segmentation |
| SDXL | Generation | 3.5B params | Slow | Text-to-image |
Next Up
Explore pretrained language models for text generation, classification, translation, and embeddings.
Next: Language Models →
Lilly Tech Systems