Beginner

RAG Architecture

Understand the two core pipelines, key components, and architecture patterns that power modern RAG systems.

Two Pipelines

Every RAG system has two distinct pipelines that work together:

Offline Pipeline (Indexing)

Runs ahead of time to prepare your knowledge base. This happens once (and is re-run when data changes):

Offline Pipeline
Documents (PDFs, web pages, databases, ...)
     ↓
1. INGEST    → Load and extract text from sources
     ↓
2. CHUNK     → Split text into manageable pieces
     ↓
3. EMBED     → Convert chunks to vectors (numbers)
     ↓
4. INDEX     → Store vectors in a vector database

// Result: A searchable index of your knowledge base

Online Pipeline (Query)

Runs in real-time when a user asks a question:

Online Pipeline
User Query: "How do I reset my password?"1. EMBED QUERY  → Convert question to a vector
     ↓
2. RETRIEVE     → Find most similar chunks in vector DB
     ↓
3. AUGMENT      → Add retrieved chunks to the LLM prompt
     ↓
4. GENERATE     → LLM produces answer from context
     ↓
Response: "To reset your password, go to Settings > ..."

Key Components

Component Purpose Examples
Document Loaders Load data from various sources PyPDF, BeautifulSoup, Unstructured
Text Splitters Break documents into chunks RecursiveCharacterTextSplitter, SentenceSplitter
Embedding Models Convert text to vectors OpenAI text-embedding-3, Cohere embed, BGE
Vector Stores Store and search vectors Pinecone, ChromaDB, Weaviate, Qdrant, pgvector
Retrievers Find relevant documents Similarity search, MMR, self-query, ensemble
LLMs Generate answers from context Claude, GPT-4, Gemini, Llama, Mistral

Architecture Patterns

1. Naive RAG

The simplest pattern: retrieve, then generate. Good starting point but has limitations.

Naive RAG Pattern
queryembedvector_searchtop_k_chunksLLM → answer

// Pros: Simple, fast to implement
// Cons: No query optimization, no reranking,
//       retrieved chunks may not be relevant

2. Advanced RAG

Adds pre-retrieval and post-retrieval optimization steps:

Advanced RAG Pattern
// Pre-retrieval: Optimize the query
queryquery_rewritemulti_query// Retrieval: Multiple strategies
              vector_search + keyword_search (hybrid)
                                    ↓
// Post-retrieval: Rerank and filter
              rerankcompressfilter// Generate with optimized context
                       LLM → answer with citations

3. Modular RAG

A flexible architecture where each component is swappable. The system can route queries to different retrieval strategies based on the question type:

Modular RAG Pattern
queryrouter
              ↓
   ┌─────────┼─────────┐
   ↓         ↓         ↓
vector   keyword   SQL
search   search    query
   ↓         ↓         ↓
   └─────────┼─────────┘
              ↓
        merge & rerankLLM → answer
Recommendation: Start with Naive RAG to get a working system quickly. Then iterate toward Advanced or Modular RAG as you identify specific quality issues with retrieval or generation.

Frameworks

Two popular frameworks simplify building RAG systems:

🔗

LangChain

The most popular RAG framework. Provides document loaders, text splitters, embeddings, vector stores, retrievers, and chains. Available in Python and JavaScript.

🐰

LlamaIndex

Purpose-built for RAG. Excels at data ingestion, indexing, and query engines. Great for complex document structures and multi-step retrieval.

What's Next?

The next lesson covers data ingestion — how to load and preprocess documents from various sources.