Beginner

RAG Architecture

Understand the two core pipelines, key components, and architecture patterns that power modern RAG systems.

Two Pipelines

Every RAG system has two distinct pipelines that work together:

Offline Pipeline (Indexing)

Runs ahead of time to prepare your knowledge base. This happens once (and is re-run when data changes):

Offline Pipeline

Documents (PDFs, web pages, databases, ...)
     ↓
1. INGEST    → Load and extract text from sources
     ↓
2. CHUNK     → Split text into manageable pieces
     ↓
3. EMBED     → Convert chunks to vectors (numbers)
     ↓
4. INDEX     → Store vectors in a vector database

// Result: A searchable index of your knowledge base

Online Pipeline (Query)

Runs in real-time when a user asks a question:

Online Pipeline

User Query: "How do I reset my password?"
     ↓
1. EMBED QUERY  → Convert question to a vector
     ↓
2. RETRIEVE     → Find most similar chunks in vector DB
     ↓
3. AUGMENT      → Add retrieved chunks to the LLM prompt
     ↓
4. GENERATE     → LLM produces answer from context
     ↓
Response: "To reset your password, go to Settings > ..."

Key Components

Component	Purpose	Examples
Document Loaders	Load data from various sources	PyPDF, BeautifulSoup, Unstructured
Text Splitters	Break documents into chunks	RecursiveCharacterTextSplitter, SentenceSplitter
Embedding Models	Convert text to vectors	OpenAI text-embedding-3, Cohere embed, BGE
Vector Stores	Store and search vectors	Pinecone, ChromaDB, Weaviate, Qdrant, pgvector
Retrievers	Find relevant documents	Similarity search, MMR, self-query, ensemble
LLMs	Generate answers from context	Claude, GPT-4, Gemini, Llama, Mistral

Architecture Patterns

1. Naive RAG

The simplest pattern: retrieve, then generate. Good starting point but has limitations.

Naive RAG Pattern

query → embed → vector_search → top_k_chunks → LLM → answer

// Pros: Simple, fast to implement
// Cons: No query optimization, no reranking,
//       retrieved chunks may not be relevant

2. Advanced RAG

Adds pre-retrieval and post-retrieval optimization steps:

Advanced RAG Pattern

// Pre-retrieval: Optimize the query
query → query_rewrite → multi_query
                                    ↓
// Retrieval: Multiple strategies
              vector_search + keyword_search (hybrid)
                                    ↓
// Post-retrieval: Rerank and filter
              rerank → compress → filter
                                    ↓
// Generate with optimized context
                       LLM → answer with citations

3. Modular RAG

A flexible architecture where each component is swappable. The system can route queries to different retrieval strategies based on the question type:

Modular RAG Pattern

query → router
              ↓
   ┌─────────┼─────────┐
   ↓         ↓         ↓
vector   keyword   SQL
search   search    query
   ↓         ↓         ↓
   └─────────┼─────────┘
              ↓
        merge & rerank
              ↓
           LLM → answer

✅

Recommendation: Start with Naive RAG to get a working system quickly. Then iterate toward Advanced or Modular RAG as you identify specific quality issues with retrieval or generation.

Frameworks

Two popular frameworks simplify building RAG systems:

🔗

LangChain

The most popular RAG framework. Provides document loaders, text splitters, embeddings, vector stores, retrievers, and chains. Available in Python and JavaScript.

🐰

LlamaIndex

Purpose-built for RAG. Excels at data ingestion, indexing, and query engines. Great for complex document structures and multi-step retrieval.

What's Next?

The next lesson covers data ingestion — how to load and preprocess documents from various sources.

← Previous Introduction Next → Data Ingestion