Multi-Modal Pretrained Models Intermediate
Multi-modal models process and generate content across multiple data types — text, images, audio, and video. These models can understand images and answer questions, generate images from text, analyze documents, and more.
Vision-Language Models
CLIP (OpenAI)
Contrastive Language-Image Pre-training. Learns to match images with text descriptions. Powers zero-shot image classification, image search, and similarity scoring.
from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") image = Image.open("photo.jpg") inputs = processor(text=["a cat", "a dog", "a car"], images=image, return_tensors="pt", padding=True) outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) print(probs) # Probability for each label
LLaVA
Large Language and Vision Assistant. Combines a vision encoder with an LLM to understand images and answer questions about them in natural language.
from transformers import pipeline vlm = pipeline("image-text-to-text", model="llava-hf/llava-1.5-7b-hf") result = vlm(image="photo.jpg", text="Describe this image in detail.") print(result)
InternVL & Qwen-VL
InternVL (Shanghai AI Lab) and Qwen-VL (Alibaba) are powerful open-source vision-language models with strong OCR, chart understanding, and visual reasoning capabilities.
Image + Text Generation
Stable Diffusion (Text-to-Image)
Generates images from text prompts using a latent diffusion model. SDXL and SD 3.0 are the latest versions.
from diffusers import StableDiffusionXLPipeline import torch pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") image = pipe("A futuristic city at night, cyberpunk style, detailed").images[0] image.save("cityscape.png")
DALL-E
OpenAI's text-to-image model. DALL-E 3 generates highly detailed, creative images. Available through the OpenAI API.
Video Understanding
VideoLLaMA
Extends LLaMA with video understanding capabilities. Can answer questions about video content, describe actions, and summarize scenes.
InternVideo
Video foundation model for action recognition, video-text retrieval, and video captioning. Trained on large-scale video-text data.
Document AI
LayoutLM (Microsoft)
Pre-trained model for document understanding. Combines text, layout (position), and image features to understand forms, invoices, receipts, and other documents.
Donut
Document understanding transformer that processes document images directly without OCR. End-to-end document parsing.
OCR Models
TrOCR (Microsoft)
Transformer-based OCR that combines an image Transformer encoder with a text Transformer decoder for text recognition.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed") image = Image.open("text_image.png").convert("RGB") pixel_values = processor(images=image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(text)
EasyOCR & PaddleOCR
EasyOCR supports 80+ languages and is easy to set up. PaddleOCR (by Baidu) offers state-of-the-art accuracy for many languages, especially Chinese.
Multi-Modal Models Summary
| Model | Modalities | Task | Best For |
|---|---|---|---|
| CLIP | Image + Text | Similarity, zero-shot classification | Image search, labeling |
| LLaVA | Image + Text | Visual Q&A, description | Image understanding |
| Stable Diffusion | Text → Image | Image generation | Creative content |
| LayoutLM | Document + Text | Document understanding | Form extraction |
| TrOCR | Image → Text | OCR | Text recognition |
Next Up
Learn the practical steps for loading, running, and serving any pretrained model.
Next: Using Models →
Lilly Tech Systems