RAG Architecture
Understand the two core pipelines, key components, and architecture patterns that power modern RAG systems.
Two Pipelines
Every RAG system has two distinct pipelines that work together:
Offline Pipeline (Indexing)
Runs ahead of time to prepare your knowledge base. This happens once (and is re-run when data changes):
Documents (PDFs, web pages, databases, ...) ↓ 1. INGEST → Load and extract text from sources ↓ 2. CHUNK → Split text into manageable pieces ↓ 3. EMBED → Convert chunks to vectors (numbers) ↓ 4. INDEX → Store vectors in a vector database // Result: A searchable index of your knowledge base
Online Pipeline (Query)
Runs in real-time when a user asks a question:
User Query: "How do I reset my password?" ↓ 1. EMBED QUERY → Convert question to a vector ↓ 2. RETRIEVE → Find most similar chunks in vector DB ↓ 3. AUGMENT → Add retrieved chunks to the LLM prompt ↓ 4. GENERATE → LLM produces answer from context ↓ Response: "To reset your password, go to Settings > ..."
Key Components
| Component | Purpose | Examples |
|---|---|---|
| Document Loaders | Load data from various sources | PyPDF, BeautifulSoup, Unstructured |
| Text Splitters | Break documents into chunks | RecursiveCharacterTextSplitter, SentenceSplitter |
| Embedding Models | Convert text to vectors | OpenAI text-embedding-3, Cohere embed, BGE |
| Vector Stores | Store and search vectors | Pinecone, ChromaDB, Weaviate, Qdrant, pgvector |
| Retrievers | Find relevant documents | Similarity search, MMR, self-query, ensemble |
| LLMs | Generate answers from context | Claude, GPT-4, Gemini, Llama, Mistral |
Architecture Patterns
1. Naive RAG
The simplest pattern: retrieve, then generate. Good starting point but has limitations.
query → embed → vector_search → top_k_chunks → LLM → answer // Pros: Simple, fast to implement // Cons: No query optimization, no reranking, // retrieved chunks may not be relevant
2. Advanced RAG
Adds pre-retrieval and post-retrieval optimization steps:
// Pre-retrieval: Optimize the query query → query_rewrite → multi_query ↓ // Retrieval: Multiple strategies vector_search + keyword_search (hybrid) ↓ // Post-retrieval: Rerank and filter rerank → compress → filter ↓ // Generate with optimized context LLM → answer with citations
3. Modular RAG
A flexible architecture where each component is swappable. The system can route queries to different retrieval strategies based on the question type:
query → router ↓ ┌─────────┼─────────┐ ↓ ↓ ↓ vector keyword SQL search search query ↓ ↓ ↓ └─────────┼─────────┘ ↓ merge & rerank ↓ LLM → answer
Frameworks
Two popular frameworks simplify building RAG systems:
LangChain
The most popular RAG framework. Provides document loaders, text splitters, embeddings, vector stores, retrievers, and chains. Available in Python and JavaScript.
LlamaIndex
Purpose-built for RAG. Excels at data ingestion, indexing, and query engines. Great for complex document structures and multi-step retrieval.
What's Next?
The next lesson covers data ingestion — how to load and preprocess documents from various sources.
Lilly Tech Systems