Large Language Models (LLMs)
The most versatile AI model type — capable of text generation, reasoning, coding, translation, and much more. Learn what LLMs are, how they work, and when they are (and are not) the right choice.
What Are Large Language Models?
Large Language Models (LLMs) are neural networks with billions of parameters, trained on massive text datasets to understand and generate human language. At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next token (word or subword). Despite this seemingly simple objective, scaling this approach to hundreds of billions of parameters and trillions of training tokens produces models with remarkable emergent abilities — including reasoning, coding, mathematical problem-solving, and creative writing.
LLMs are the most general-purpose model type in AI today. A single LLM can perform hundreds of distinct tasks through prompting alone, without any task-specific training. This versatility is what makes them the default starting point for many AI applications — but it is also why they are sometimes used where a more specialized model would be a better fit.
Key Architectures
Not all LLMs are built the same way. There are two dominant architectures:
Decoder-Only Transformers
The vast majority of modern LLMs use a decoder-only architecture. These models generate text left-to-right, one token at a time. Each token can only attend to tokens that came before it (causal attention). This architecture is optimized for text generation tasks.
- Examples: GPT-4, GPT-4o, Claude 4, LLaMA 3, Mistral, Qwen 2.5, Phi-4, Command R+
- Strengths: Excellent at open-ended generation, reasoning, coding, conversation
- Training objective: Predict the next token given all previous tokens
Encoder-Decoder Transformers
Some models use a two-part architecture: an encoder processes the full input bidirectionally, and a decoder generates the output. This architecture excels at tasks where understanding the complete input is critical before generating output.
- Examples: T5, FLAN-T5, UL2, mT5
- Strengths: Translation, summarization, structured extraction
- Training objective: Reconstruct corrupted text spans (span corruption)
Major LLMs in 2025
The LLM landscape is evolving rapidly. Here are the most important models to know:
Closed-Source (API-Only) Models
| Model | Provider | Parameters | Context Window | Key Strengths |
|---|---|---|---|---|
| GPT-4o | OpenAI | ~1.8T (estimated) | 128K tokens | Multimodal, fast, strong reasoning, function calling |
| GPT-4 Turbo | OpenAI | ~1.8T (estimated) | 128K tokens | Strong coding, vision, JSON mode |
| Claude 4 | Anthropic | Not disclosed | 1M tokens | Extended thinking, agentic coding, longest context window, safety |
| Claude 4 Sonnet | Anthropic | Not disclosed | 200K tokens | Best balance of speed and intelligence, strong coding |
| Gemini 2.5 Pro | Not disclosed | 1M tokens | Long context, multimodal, grounding with Google Search | |
| Command R+ | Cohere | 104B | 128K tokens | RAG-optimized, multilingual, enterprise focus |
Open-Weight Models
| Model | Provider | Parameters | Context Window | Key Strengths |
|---|---|---|---|---|
| LLaMA 3 405B | Meta | 405B | 128K tokens | Most capable open model, competitive with GPT-4 |
| LLaMA 3 70B | Meta | 70B | 128K tokens | Strong all-around, can run on multi-GPU setups |
| LLaMA 3 8B | Meta | 8B | 128K tokens | Runs on consumer GPUs, great for fine-tuning |
| Mistral Large 2 | Mistral AI | 123B | 128K tokens | Strong multilingual, coding, function calling |
| Mixtral 8x22B | Mistral AI | 176B (39B active) | 64K tokens | Mixture of Experts, efficient inference |
| Qwen 2.5 72B | Alibaba | 72B | 128K tokens | Strong math, coding, multilingual (especially Chinese) |
| Phi-4 | Microsoft | 14B | 16K tokens | Exceptionally capable for its size, strong reasoning |
| DeepSeek-V3 | DeepSeek | 671B (37B active) | 128K tokens | MoE architecture, strong coding and math, cost-efficient |
Parameters and Scale
The number of parameters in an LLM is one of the most discussed metrics, but it is not the only factor that determines capability. Training data quality, training methodology (RLHF, DPO), and architecture innovations all play critical roles.
- 1B parameters = 1 billion learnable weights in the neural network (~2 GB in FP16)
- 7B–13B: Can run on a single consumer GPU (24 GB VRAM) with quantization
- 70B: Requires 2–4 high-end GPUs or cloud deployment
- 400B+: Requires large GPU clusters, typically accessed via API
A Mixture of Experts (MoE) model like Mixtral 8x22B has 176B total parameters but only activates 39B per token, making inference much faster and cheaper than a dense 176B model.
Core Capabilities
Modern LLMs are remarkably versatile. Here are their primary capabilities:
Text Generation
The foundational capability of all LLMs. They can write articles, emails, marketing copy, fiction, documentation, and virtually any other text format. Quality varies by model — frontier models like Claude 4 and GPT-4o produce text that is often indistinguishable from expert human writing.
Code Generation and Understanding
LLMs have become powerful coding assistants. They can write code in dozens of languages, debug existing code, explain complex functions, convert between languages, and even build complete applications. Specialized coding models (like Codestral and DeepSeek-Coder) push this capability further.
# Prompt: "Write a Python function that calculates the Fibonacci
# sequence using memoization"
# LLM Output:
def fibonacci(n: int, memo: dict = {}) -> int:
"""Calculate nth Fibonacci number using memoization."""
if n in memo:
return memo[n]
if n <= 1:
return n
memo[n] = fibonacci(n - 1, memo) + fibonacci(n - 2, memo)
return memo[n]
# Usage
print(fibonacci(50)) # 12586269025 (instant, no redundant computation)
Reasoning and Analysis
Frontier LLMs can perform multi-step reasoning, solve mathematical problems, analyze complex scenarios, and draw logical conclusions. "Extended thinking" or "chain-of-thought" capabilities in models like Claude 4 and o1 allow them to work through problems step by step before providing an answer.
Translation
LLMs provide high-quality translation between languages, often rivaling dedicated translation systems. They excel at nuanced, context-aware translation that captures tone and intent, not just literal meaning.
Summarization
Given long documents, articles, or conversation histories, LLMs can produce concise, accurate summaries. Models with large context windows (Claude 4 with 1M tokens, Gemini 2.5 Pro with 1M tokens) can summarize entire books.
Limitations
Despite their impressive capabilities, LLMs have significant limitations that every practitioner must understand:
- Hallucinations: LLMs can generate confident, plausible-sounding text that is factually incorrect. They do not "know" facts — they predict statistically likely text. Always verify factual claims from LLM output, especially for medical, legal, and financial content.
- Bias: Models reflect biases present in their training data, including cultural, gender, racial, and political biases. This can lead to unfair or harmful outputs in sensitive applications.
- Cost: Frontier LLM API calls cost $5–$75 per million tokens. For high-volume applications (millions of requests/day), this adds up to thousands of dollars daily. Smaller, specialized models can be 100x cheaper.
- Knowledge cutoff: LLMs only know information from their training data. They cannot access real-time information unless connected to external tools (search, databases, APIs).
- Latency: Generating long responses takes seconds to tens of seconds. For real-time applications requiring sub-100ms response times, LLMs are often too slow.
- Context window limits: Even models with 1M token windows can struggle with very long inputs. Performance often degrades in the middle of very long contexts ("lost in the middle" problem).
Use Cases by Industry
| Industry | Use Case | Recommended Models |
|---|---|---|
| Software Development | Code generation, code review, debugging, documentation | Claude 4, GPT-4o, Codestral, DeepSeek-Coder |
| Customer Support | Chatbots, email drafting, ticket classification, FAQ answers | Claude 4 Sonnet, GPT-4o mini, Command R+ |
| Healthcare | Medical documentation, patient communication, research summaries | GPT-4 (with guardrails), Med-PaLM 2, fine-tuned LLaMA |
| Legal | Contract analysis, case research, document drafting | Claude 4 (long context), GPT-4 Turbo |
| Finance | Report generation, market analysis, compliance checks | GPT-4o, Claude 4 Sonnet, fine-tuned models |
| Education | Tutoring, content creation, grading assistance, personalized learning | Claude 4, GPT-4o, LLaMA 3 70B |
| Marketing | Copy writing, SEO content, social media, email campaigns | Claude 4 Sonnet, GPT-4o, Mistral Large |
| Research | Literature review, hypothesis generation, data analysis | Claude 4 (1M context), Gemini 2.5 Pro |
When to Use LLMs vs Other Model Types
LLMs are versatile, but they are not always the right choice. Here is a practical decision guide:
✅ Use an LLM When
- You need open-ended text generation
- The task requires reasoning or multi-step logic
- You need a conversational interface
- The task is complex and hard to define with rules
- You need flexibility across many task types
- Latency of 1–10 seconds is acceptable
❌ Use a Different Model When
- Simple classification → Use BERT/DistilBERT (faster, cheaper)
- Semantic search → Use embedding models (purpose-built)
- Image analysis → Use vision models like YOLO or SAM
- Speech transcription → Use Whisper or Deepgram
- Tabular data prediction → Use XGBoost or LightGBM
- Sub-100ms latency required → Use a specialized small model
- Millions of daily requests on a budget → Use fine-tuned small models
The Cost-Performance Tradeoff
Understanding pricing is essential for production LLM applications:
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Speed (tokens/sec) |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~80–100 |
| GPT-4o mini | $0.15 | $0.60 | ~120–150 |
| Claude 4 Sonnet | $3.00 | $15.00 | ~70–90 |
| Claude 4 Haiku | $0.25 | $1.25 | ~150–200 |
| Gemini 2.5 Pro | $1.25 | $5.00 | ~80–100 |
| LLaMA 3 70B (self-hosted) | ~$0.50* | ~$0.50* | ~30–60 |
| LLaMA 3 8B (self-hosted) | ~$0.05* | ~$0.05* | ~80–120 |
* Self-hosted costs are approximate and depend on GPU hardware, utilization rate, and hosting provider. Costs shown are compute-only, excluding engineering and maintenance overhead.
Key Takeaways
- LLMs are the most versatile AI model type, capable of text generation, coding, reasoning, translation, and analysis.
- Modern LLMs use decoder-only Transformer architectures and range from 1B to 1T+ parameters.
- Both closed-source (GPT-4, Claude 4, Gemini) and open-weight (LLaMA 3, Mistral, Qwen) models are available, each with different tradeoffs.
- LLMs have significant limitations including hallucinations, bias, high cost, and latency.
- For many specific tasks (classification, search, tabular prediction), specialized models outperform LLMs at a fraction of the cost.
- The best practice is to prototype with a frontier LLM, then optimize with smaller or specialized models for production.
Lilly Tech Systems