Intermediate

Large Language Models (LLMs)

The most versatile AI model type — capable of text generation, reasoning, coding, translation, and much more. Learn what LLMs are, how they work, and when they are (and are not) the right choice.

What Are Large Language Models?

Large Language Models (LLMs) are neural networks with billions of parameters, trained on massive text datasets to understand and generate human language. At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next token (word or subword). Despite this seemingly simple objective, scaling this approach to hundreds of billions of parameters and trillions of training tokens produces models with remarkable emergent abilities — including reasoning, coding, mathematical problem-solving, and creative writing.

LLMs are the most general-purpose model type in AI today. A single LLM can perform hundreds of distinct tasks through prompting alone, without any task-specific training. This versatility is what makes them the default starting point for many AI applications — but it is also why they are sometimes used where a more specialized model would be a better fit.

💡

Definition: An LLM is a neural network model (typically based on the Transformer architecture) with at least several billion parameters, pre-trained on a large corpus of text data using self-supervised learning (next-token prediction), and capable of understanding and generating natural language across a wide range of tasks.

Key Architectures

Not all LLMs are built the same way. There are two dominant architectures:

Decoder-Only Transformers

The vast majority of modern LLMs use a decoder-only architecture. These models generate text left-to-right, one token at a time. Each token can only attend to tokens that came before it (causal attention). This architecture is optimized for text generation tasks.

Examples: GPT-4, GPT-4o, Claude 4, LLaMA 3, Mistral, Qwen 2.5, Phi-4, Command R+
Strengths: Excellent at open-ended generation, reasoning, coding, conversation
Training objective: Predict the next token given all previous tokens

Encoder-Decoder Transformers

Some models use a two-part architecture: an encoder processes the full input bidirectionally, and a decoder generates the output. This architecture excels at tasks where understanding the complete input is critical before generating output.

Examples: T5, FLAN-T5, UL2, mT5
Strengths: Translation, summarization, structured extraction
Training objective: Reconstruct corrupted text spans (span corruption)

💡

Practical note: In 2025, decoder-only models dominate the LLM landscape. When someone says "LLM," they almost always mean a decoder-only Transformer. Encoder-decoder models are still used in specific production systems (especially translation), but new frontier models are almost exclusively decoder-only.

Major LLMs in 2025

The LLM landscape is evolving rapidly. Here are the most important models to know:

Closed-Source (API-Only) Models

Model	Provider	Parameters	Context Window	Key Strengths
GPT-4o	OpenAI	~1.8T (estimated)	128K tokens	Multimodal, fast, strong reasoning, function calling
GPT-4 Turbo	OpenAI	~1.8T (estimated)	128K tokens	Strong coding, vision, JSON mode
Claude 4	Anthropic	Not disclosed	1M tokens	Extended thinking, agentic coding, longest context window, safety
Claude 4 Sonnet	Anthropic	Not disclosed	200K tokens	Best balance of speed and intelligence, strong coding
Gemini 2.5 Pro	Google	Not disclosed	1M tokens	Long context, multimodal, grounding with Google Search
Command R+	Cohere	104B	128K tokens	RAG-optimized, multilingual, enterprise focus

Open-Weight Models

Model	Provider	Parameters	Context Window	Key Strengths
LLaMA 3 405B	Meta	405B	128K tokens	Most capable open model, competitive with GPT-4
LLaMA 3 70B	Meta	70B	128K tokens	Strong all-around, can run on multi-GPU setups
LLaMA 3 8B	Meta	8B	128K tokens	Runs on consumer GPUs, great for fine-tuning
Mistral Large 2	Mistral AI	123B	128K tokens	Strong multilingual, coding, function calling
Mixtral 8x22B	Mistral AI	176B (39B active)	64K tokens	Mixture of Experts, efficient inference
Qwen 2.5 72B	Alibaba	72B	128K tokens	Strong math, coding, multilingual (especially Chinese)
Phi-4	Microsoft	14B	16K tokens	Exceptionally capable for its size, strong reasoning
DeepSeek-V3	DeepSeek	671B (37B active)	128K tokens	MoE architecture, strong coding and math, cost-efficient

Parameters and Scale

The number of parameters in an LLM is one of the most discussed metrics, but it is not the only factor that determines capability. Training data quality, training methodology (RLHF, DPO), and architecture innovations all play critical roles.

📝

Understanding parameter counts:

1B parameters = 1 billion learnable weights in the neural network (~2 GB in FP16)
7B–13B: Can run on a single consumer GPU (24 GB VRAM) with quantization
70B: Requires 2–4 high-end GPUs or cloud deployment
400B+: Requires large GPU clusters, typically accessed via API

A Mixture of Experts (MoE) model like Mixtral 8x22B has 176B total parameters but only activates 39B per token, making inference much faster and cheaper than a dense 176B model.

Core Capabilities

Modern LLMs are remarkably versatile. Here are their primary capabilities:

Text Generation

The foundational capability of all LLMs. They can write articles, emails, marketing copy, fiction, documentation, and virtually any other text format. Quality varies by model — frontier models like Claude 4 and GPT-4o produce text that is often indistinguishable from expert human writing.

Code Generation and Understanding

LLMs have become powerful coding assistants. They can write code in dozens of languages, debug existing code, explain complex functions, convert between languages, and even build complete applications. Specialized coding models (like Codestral and DeepSeek-Coder) push this capability further.

Example: Using an LLM for code generation

# Prompt: "Write a Python function that calculates the Fibonacci
# sequence using memoization"

# LLM Output:
def fibonacci(n: int, memo: dict = {}) -> int:
    """Calculate nth Fibonacci number using memoization."""
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fibonacci(n - 1, memo) + fibonacci(n - 2, memo)
    return memo[n]

# Usage
print(fibonacci(50))  # 12586269025 (instant, no redundant computation)

Reasoning and Analysis

Frontier LLMs can perform multi-step reasoning, solve mathematical problems, analyze complex scenarios, and draw logical conclusions. "Extended thinking" or "chain-of-thought" capabilities in models like Claude 4 and o1 allow them to work through problems step by step before providing an answer.

Translation

LLMs provide high-quality translation between languages, often rivaling dedicated translation systems. They excel at nuanced, context-aware translation that captures tone and intent, not just literal meaning.

Summarization

Given long documents, articles, or conversation histories, LLMs can produce concise, accurate summaries. Models with large context windows (Claude 4 with 1M tokens, Gemini 2.5 Pro with 1M tokens) can summarize entire books.

Limitations

Despite their impressive capabilities, LLMs have significant limitations that every practitioner must understand:

⚠

Critical LLM Limitations:

Hallucinations: LLMs can generate confident, plausible-sounding text that is factually incorrect. They do not "know" facts — they predict statistically likely text. Always verify factual claims from LLM output, especially for medical, legal, and financial content.
Bias: Models reflect biases present in their training data, including cultural, gender, racial, and political biases. This can lead to unfair or harmful outputs in sensitive applications.
Cost: Frontier LLM API calls cost $5–$75 per million tokens. For high-volume applications (millions of requests/day), this adds up to thousands of dollars daily. Smaller, specialized models can be 100x cheaper.
Knowledge cutoff: LLMs only know information from their training data. They cannot access real-time information unless connected to external tools (search, databases, APIs).
Latency: Generating long responses takes seconds to tens of seconds. For real-time applications requiring sub-100ms response times, LLMs are often too slow.
Context window limits: Even models with 1M token windows can struggle with very long inputs. Performance often degrades in the middle of very long contexts ("lost in the middle" problem).

Use Cases by Industry

Industry	Use Case	Recommended Models
Software Development	Code generation, code review, debugging, documentation	Claude 4, GPT-4o, Codestral, DeepSeek-Coder
Customer Support	Chatbots, email drafting, ticket classification, FAQ answers	Claude 4 Sonnet, GPT-4o mini, Command R+
Healthcare	Medical documentation, patient communication, research summaries	GPT-4 (with guardrails), Med-PaLM 2, fine-tuned LLaMA
Legal	Contract analysis, case research, document drafting	Claude 4 (long context), GPT-4 Turbo
Finance	Report generation, market analysis, compliance checks	GPT-4o, Claude 4 Sonnet, fine-tuned models
Education	Tutoring, content creation, grading assistance, personalized learning	Claude 4, GPT-4o, LLaMA 3 70B
Marketing	Copy writing, SEO content, social media, email campaigns	Claude 4 Sonnet, GPT-4o, Mistral Large
Research	Literature review, hypothesis generation, data analysis	Claude 4 (1M context), Gemini 2.5 Pro

When to Use LLMs vs Other Model Types

LLMs are versatile, but they are not always the right choice. Here is a practical decision guide:

✅ Use an LLM When

You need open-ended text generation
The task requires reasoning or multi-step logic
You need a conversational interface
The task is complex and hard to define with rules
You need flexibility across many task types
Latency of 1–10 seconds is acceptable

❌ Use a Different Model When

Simple classification → Use BERT/DistilBERT (faster, cheaper)
Semantic search → Use embedding models (purpose-built)
Image analysis → Use vision models like YOLO or SAM
Speech transcription → Use Whisper or Deepgram
Tabular data prediction → Use XGBoost or LightGBM
Sub-100ms latency required → Use a specialized small model
Millions of daily requests on a budget → Use fine-tuned small models

💡

Best practice: Start with an LLM to prototype and validate your use case. Once you understand the task well, evaluate whether a smaller, specialized model could handle it at lower cost and higher speed. Many production systems start with GPT-4 during development and migrate to a fine-tuned LLaMA 3 8B or a BERT classifier for deployment.

The Cost-Performance Tradeoff

Understanding pricing is essential for production LLM applications:

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Speed (tokens/sec)
GPT-4o	$2.50	$10.00	~80–100
GPT-4o mini	$0.15	$0.60	~120–150
Claude 4 Sonnet	$3.00	$15.00	~70–90
Claude 4 Haiku	$0.25	$1.25	~150–200
Gemini 2.5 Pro	$1.25	$5.00	~80–100
LLaMA 3 70B (self-hosted)	~$0.50*	~$0.50*	~30–60
LLaMA 3 8B (self-hosted)	~$0.05*	~$0.05*	~80–120

* Self-hosted costs are approximate and depend on GPU hardware, utilization rate, and hosting provider. Costs shown are compute-only, excluding engineering and maintenance overhead.

Key Takeaways

💡

LLMs are the most versatile AI model type, capable of text generation, coding, reasoning, translation, and analysis.
Modern LLMs use decoder-only Transformer architectures and range from 1B to 1T+ parameters.
Both closed-source (GPT-4, Claude 4, Gemini) and open-weight (LLaMA 3, Mistral, Qwen) models are available, each with different tradeoffs.
LLMs have significant limitations including hallucinations, bias, high cost, and latency.
For many specific tasks (classification, search, tabular prediction), specialized models outperform LLMs at a fraction of the cost.
The best practice is to prototype with a frontier LLM, then optimize with smaller or specialized models for production.

← Previous Introduction Next → Embedding Models