Intermediate

Large Language Models (LLMs)

The most versatile AI model type — capable of text generation, reasoning, coding, translation, and much more. Learn what LLMs are, how they work, and when they are (and are not) the right choice.

What Are Large Language Models?

Large Language Models (LLMs) are neural networks with billions of parameters, trained on massive text datasets to understand and generate human language. At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next token (word or subword). Despite this seemingly simple objective, scaling this approach to hundreds of billions of parameters and trillions of training tokens produces models with remarkable emergent abilities — including reasoning, coding, mathematical problem-solving, and creative writing.

LLMs are the most general-purpose model type in AI today. A single LLM can perform hundreds of distinct tasks through prompting alone, without any task-specific training. This versatility is what makes them the default starting point for many AI applications — but it is also why they are sometimes used where a more specialized model would be a better fit.

💡
Definition: An LLM is a neural network model (typically based on the Transformer architecture) with at least several billion parameters, pre-trained on a large corpus of text data using self-supervised learning (next-token prediction), and capable of understanding and generating natural language across a wide range of tasks.

Key Architectures

Not all LLMs are built the same way. There are two dominant architectures:

Decoder-Only Transformers

The vast majority of modern LLMs use a decoder-only architecture. These models generate text left-to-right, one token at a time. Each token can only attend to tokens that came before it (causal attention). This architecture is optimized for text generation tasks.

  • Examples: GPT-4, GPT-4o, Claude 4, LLaMA 3, Mistral, Qwen 2.5, Phi-4, Command R+
  • Strengths: Excellent at open-ended generation, reasoning, coding, conversation
  • Training objective: Predict the next token given all previous tokens

Encoder-Decoder Transformers

Some models use a two-part architecture: an encoder processes the full input bidirectionally, and a decoder generates the output. This architecture excels at tasks where understanding the complete input is critical before generating output.

  • Examples: T5, FLAN-T5, UL2, mT5
  • Strengths: Translation, summarization, structured extraction
  • Training objective: Reconstruct corrupted text spans (span corruption)
💡
Practical note: In 2025, decoder-only models dominate the LLM landscape. When someone says "LLM," they almost always mean a decoder-only Transformer. Encoder-decoder models are still used in specific production systems (especially translation), but new frontier models are almost exclusively decoder-only.

Major LLMs in 2025

The LLM landscape is evolving rapidly. Here are the most important models to know:

Closed-Source (API-Only) Models

ModelProviderParametersContext WindowKey Strengths
GPT-4oOpenAI~1.8T (estimated)128K tokensMultimodal, fast, strong reasoning, function calling
GPT-4 TurboOpenAI~1.8T (estimated)128K tokensStrong coding, vision, JSON mode
Claude 4AnthropicNot disclosed1M tokensExtended thinking, agentic coding, longest context window, safety
Claude 4 SonnetAnthropicNot disclosed200K tokensBest balance of speed and intelligence, strong coding
Gemini 2.5 ProGoogleNot disclosed1M tokensLong context, multimodal, grounding with Google Search
Command R+Cohere104B128K tokensRAG-optimized, multilingual, enterprise focus

Open-Weight Models

ModelProviderParametersContext WindowKey Strengths
LLaMA 3 405BMeta405B128K tokensMost capable open model, competitive with GPT-4
LLaMA 3 70BMeta70B128K tokensStrong all-around, can run on multi-GPU setups
LLaMA 3 8BMeta8B128K tokensRuns on consumer GPUs, great for fine-tuning
Mistral Large 2Mistral AI123B128K tokensStrong multilingual, coding, function calling
Mixtral 8x22BMistral AI176B (39B active)64K tokensMixture of Experts, efficient inference
Qwen 2.5 72BAlibaba72B128K tokensStrong math, coding, multilingual (especially Chinese)
Phi-4Microsoft14B16K tokensExceptionally capable for its size, strong reasoning
DeepSeek-V3DeepSeek671B (37B active)128K tokensMoE architecture, strong coding and math, cost-efficient

Parameters and Scale

The number of parameters in an LLM is one of the most discussed metrics, but it is not the only factor that determines capability. Training data quality, training methodology (RLHF, DPO), and architecture innovations all play critical roles.

📝
Understanding parameter counts:
  • 1B parameters = 1 billion learnable weights in the neural network (~2 GB in FP16)
  • 7B–13B: Can run on a single consumer GPU (24 GB VRAM) with quantization
  • 70B: Requires 2–4 high-end GPUs or cloud deployment
  • 400B+: Requires large GPU clusters, typically accessed via API

A Mixture of Experts (MoE) model like Mixtral 8x22B has 176B total parameters but only activates 39B per token, making inference much faster and cheaper than a dense 176B model.

Core Capabilities

Modern LLMs are remarkably versatile. Here are their primary capabilities:

Text Generation

The foundational capability of all LLMs. They can write articles, emails, marketing copy, fiction, documentation, and virtually any other text format. Quality varies by model — frontier models like Claude 4 and GPT-4o produce text that is often indistinguishable from expert human writing.

Code Generation and Understanding

LLMs have become powerful coding assistants. They can write code in dozens of languages, debug existing code, explain complex functions, convert between languages, and even build complete applications. Specialized coding models (like Codestral and DeepSeek-Coder) push this capability further.

Example: Using an LLM for code generation
# Prompt: "Write a Python function that calculates the Fibonacci
# sequence using memoization"

# LLM Output:
def fibonacci(n: int, memo: dict = {}) -> int:
    """Calculate nth Fibonacci number using memoization."""
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fibonacci(n - 1, memo) + fibonacci(n - 2, memo)
    return memo[n]

# Usage
print(fibonacci(50))  # 12586269025 (instant, no redundant computation)

Reasoning and Analysis

Frontier LLMs can perform multi-step reasoning, solve mathematical problems, analyze complex scenarios, and draw logical conclusions. "Extended thinking" or "chain-of-thought" capabilities in models like Claude 4 and o1 allow them to work through problems step by step before providing an answer.

Translation

LLMs provide high-quality translation between languages, often rivaling dedicated translation systems. They excel at nuanced, context-aware translation that captures tone and intent, not just literal meaning.

Summarization

Given long documents, articles, or conversation histories, LLMs can produce concise, accurate summaries. Models with large context windows (Claude 4 with 1M tokens, Gemini 2.5 Pro with 1M tokens) can summarize entire books.

Limitations

Despite their impressive capabilities, LLMs have significant limitations that every practitioner must understand:

Critical LLM Limitations:
  • Hallucinations: LLMs can generate confident, plausible-sounding text that is factually incorrect. They do not "know" facts — they predict statistically likely text. Always verify factual claims from LLM output, especially for medical, legal, and financial content.
  • Bias: Models reflect biases present in their training data, including cultural, gender, racial, and political biases. This can lead to unfair or harmful outputs in sensitive applications.
  • Cost: Frontier LLM API calls cost $5–$75 per million tokens. For high-volume applications (millions of requests/day), this adds up to thousands of dollars daily. Smaller, specialized models can be 100x cheaper.
  • Knowledge cutoff: LLMs only know information from their training data. They cannot access real-time information unless connected to external tools (search, databases, APIs).
  • Latency: Generating long responses takes seconds to tens of seconds. For real-time applications requiring sub-100ms response times, LLMs are often too slow.
  • Context window limits: Even models with 1M token windows can struggle with very long inputs. Performance often degrades in the middle of very long contexts ("lost in the middle" problem).

Use Cases by Industry

IndustryUse CaseRecommended Models
Software DevelopmentCode generation, code review, debugging, documentationClaude 4, GPT-4o, Codestral, DeepSeek-Coder
Customer SupportChatbots, email drafting, ticket classification, FAQ answersClaude 4 Sonnet, GPT-4o mini, Command R+
HealthcareMedical documentation, patient communication, research summariesGPT-4 (with guardrails), Med-PaLM 2, fine-tuned LLaMA
LegalContract analysis, case research, document draftingClaude 4 (long context), GPT-4 Turbo
FinanceReport generation, market analysis, compliance checksGPT-4o, Claude 4 Sonnet, fine-tuned models
EducationTutoring, content creation, grading assistance, personalized learningClaude 4, GPT-4o, LLaMA 3 70B
MarketingCopy writing, SEO content, social media, email campaignsClaude 4 Sonnet, GPT-4o, Mistral Large
ResearchLiterature review, hypothesis generation, data analysisClaude 4 (1M context), Gemini 2.5 Pro

When to Use LLMs vs Other Model Types

LLMs are versatile, but they are not always the right choice. Here is a practical decision guide:

✅ Use an LLM When

  • You need open-ended text generation
  • The task requires reasoning or multi-step logic
  • You need a conversational interface
  • The task is complex and hard to define with rules
  • You need flexibility across many task types
  • Latency of 1–10 seconds is acceptable

❌ Use a Different Model When

  • Simple classification → Use BERT/DistilBERT (faster, cheaper)
  • Semantic search → Use embedding models (purpose-built)
  • Image analysis → Use vision models like YOLO or SAM
  • Speech transcription → Use Whisper or Deepgram
  • Tabular data prediction → Use XGBoost or LightGBM
  • Sub-100ms latency required → Use a specialized small model
  • Millions of daily requests on a budget → Use fine-tuned small models
💡
Best practice: Start with an LLM to prototype and validate your use case. Once you understand the task well, evaluate whether a smaller, specialized model could handle it at lower cost and higher speed. Many production systems start with GPT-4 during development and migrate to a fine-tuned LLaMA 3 8B or a BERT classifier for deployment.

The Cost-Performance Tradeoff

Understanding pricing is essential for production LLM applications:

ModelInput Cost (per 1M tokens)Output Cost (per 1M tokens)Speed (tokens/sec)
GPT-4o$2.50$10.00~80–100
GPT-4o mini$0.15$0.60~120–150
Claude 4 Sonnet$3.00$15.00~70–90
Claude 4 Haiku$0.25$1.25~150–200
Gemini 2.5 Pro$1.25$5.00~80–100
LLaMA 3 70B (self-hosted)~$0.50*~$0.50*~30–60
LLaMA 3 8B (self-hosted)~$0.05*~$0.05*~80–120

* Self-hosted costs are approximate and depend on GPU hardware, utilization rate, and hosting provider. Costs shown are compute-only, excluding engineering and maintenance overhead.

Key Takeaways

💡
  • LLMs are the most versatile AI model type, capable of text generation, coding, reasoning, translation, and analysis.
  • Modern LLMs use decoder-only Transformer architectures and range from 1B to 1T+ parameters.
  • Both closed-source (GPT-4, Claude 4, Gemini) and open-weight (LLaMA 3, Mistral, Qwen) models are available, each with different tradeoffs.
  • LLMs have significant limitations including hallucinations, bias, high cost, and latency.
  • For many specific tasks (classification, search, tabular prediction), specialized models outperform LLMs at a fraction of the cost.
  • The best practice is to prototype with a frontier LLM, then optimize with smaller or specialized models for production.