Pretrained Vision Models Intermediate

A comprehensive directory of pretrained computer vision models for image classification, object detection, segmentation, and image generation — with practical code examples for each.

Image Classification

ResNet (Residual Network)

The pioneering deep CNN architecture that introduced skip connections. Available in ResNet-18, 34, 50, 101, and 152 variants. Trained on ImageNet (1.2M images, 1000 classes).

Python

import torchvision.models as models
from torchvision import transforms
from PIL import Image

# Load pretrained ResNet-50
model = models.resnet50(weights="IMAGENET1K_V2")
model.eval()

# Preprocess image
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = preprocess(Image.open("photo.jpg")).unsqueeze(0)
output = model(img)

EfficientNet

Optimized for accuracy-to-compute ratio using neural architecture search. EfficientNet-B0 through B7 offer increasing accuracy with increasing compute.

Vision Transformer (ViT)

Applies the Transformer architecture to images by splitting them into patches. Achieves state-of-the-art accuracy with enough data.

Python

from transformers import pipeline

classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("photo.jpg")
print(result)
# [{'label': 'golden retriever', 'score': 0.95}, ...]

ConvNeXt

A modernized CNN that matches ViT performance by applying Transformer design principles to convolutional architectures. Simpler and faster to train.

Object Detection

YOLO (You Only Look Once)

The fastest real-time object detection family. YOLOv5, YOLOv8, and YOLOv11 (by Ultralytics) are the most popular versions.

Python

from ultralytics import YOLO

# Load pretrained YOLOv8
model = YOLO("yolov8n.pt")  # nano, small, medium, large, xlarge

# Run detection
results = model("image.jpg")
results[0].show()  # Display annotated image

DETR (Detection Transformer)

Facebook's end-to-end object detection with Transformers. No need for anchor boxes or non-maximum suppression.

Faster R-CNN

Two-stage detector that first proposes regions, then classifies them. Higher accuracy than single-stage detectors but slower.

Segmentation

SAM (Segment Anything Model)

Meta's foundation model for image segmentation. Can segment any object with a point, box, or text prompt.

Python

from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
predictor.set_image(image)
masks, scores, logits = predictor.predict(point_coords=points, point_labels=labels)

Mask R-CNN

Extends Faster R-CNN with instance segmentation. Detects objects and generates pixel-level masks for each instance.

DeepLab

Google's semantic segmentation model using atrous convolutions. Classifies every pixel in the image into predefined categories.

Image Generation

Stable Diffusion

Open-source text-to-image model by Stability AI. Generates high-quality images from text descriptions.

Python

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

image = pipe("A serene mountain lake at sunset, photorealistic").images[0]
image.save("output.png")

Vision Models Comparison

Model	Task	Size	Speed	Best For
ResNet-50	Classification	25M params	Fast	Baseline, transfer learning
ViT-Base	Classification	86M params	Medium	High accuracy
YOLOv8-n	Detection	3M params	Very fast	Real-time detection
SAM-ViT-H	Segmentation	636M params	Medium	Universal segmentation
SDXL	Generation	3.5B params	Slow	Text-to-image

Next Up

Explore pretrained language models for text generation, classification, translation, and embeddings.

Next: Language Models →

← Hugging Face Hub Language Models →