Advanced

RAG Best Practices

Production-ready RAG: optimization checklist, common failures, deployment, cost management, scaling, and FAQ.

RAG Optimization Checklist

✅

Review before going to production:

Data is cleaned and preprocessed (no garbage text)
Chunk size tested and optimized (start with 1000 chars)
Chunk overlap set to 10-20% of chunk size
Metadata attached to every chunk (source, date, category)
Embedding model selected and benchmarked
Retrieval tested with real user queries
Reranking added (biggest quality improvement)
Prompt instructs LLM to only use provided context
Citations included in responses
Evaluation dataset built with 50+ examples
RAGAS metrics above target thresholds
Fallback message for "no relevant context found"

Common Failure Modes and Fixes

Failure Mode	Symptom	Root Cause	Fix
Hallucination	Answer includes facts not in context	Weak prompt, model too creative	Strengthen prompt, lower temperature, add reranking
Missing context	"I don't know" when answer exists	Poor chunking or embedding	Adjust chunk size, try different embeddings, multi-query
Wrong context	Answer from wrong document	Retriever returning irrelevant results	Add metadata filtering, hybrid search, reranking
Stale data	Answer uses outdated information	Knowledge base not updated	Implement incremental indexing, schedule re-ingestion
Incomplete answer	Answer is partially correct	Relevant info split across chunks	Use parent-child chunking, increase top-k, increase overlap

Production Deployment Guide

Separate Indexing from Serving

Run the indexing pipeline (ingest, chunk, embed) as a batch job. Run the query pipeline (retrieve, generate) as a web service. This lets you update the index without downtime.
Cache Frequently Asked Questions

Cache answers for common queries. Use a semantic cache (embed the query, check if a similar query was recently answered) to avoid redundant LLM calls.
Implement Monitoring

Track latency (retrieval time, generation time), quality metrics (faithfulness, relevancy), and usage patterns. Set up alerts for quality degradation.
Add a Feedback Loop

Let users rate answers (thumbs up/down). Use this feedback to identify weak areas and expand your evaluation dataset.

Cost Optimization

💰

Use Tiered Models

Route simple questions to cheap models (Claude Haiku, GPT-4o mini). Only use expensive models for complex queries.

📊

Optimize Context Size

Send only the most relevant chunks to the LLM. Fewer tokens = lower cost. Reranking helps select only the best chunks.

📋

Cache Aggressively

Cache embedding results, retrieval results, and LLM responses. A good cache can reduce LLM API calls by 30-50%.

📈

Batch Embeddings

Embed documents in batches rather than one at a time. Most embedding APIs offer batch endpoints at lower per-token costs.

Scaling Strategies

Vector database scaling: Use managed services (Pinecone, Weaviate Cloud) that handle sharding and replication automatically.
Horizontal scaling: Run multiple instances of your query service behind a load balancer.
Async processing: For bulk queries, use a job queue (Redis, SQS) to process requests asynchronously.
CDN for static content: If your RAG serves documentation, cache generated pages at the CDN level.

Security: Data Access Control

In production RAG systems, different users should only see documents they have access to:

Python - Access Control

# Store user permissions as metadata on each chunk
chunk.metadata["allowed_roles"] = ["engineering", "admin"]

# At query time, filter by user's role
def search_with_access_control(query, user_roles):
    return vectorstore.similarity_search(
        query,
        k=5,
        filter={
            "allowed_roles": {"$in": user_roles}
        }
    )

⚠

Critical: Never rely on the LLM to enforce access control. Filter at the retrieval layer, before any document reaches the LLM. A compromised prompt could bypass LLM-level access checks.

Keeping Data Fresh

Incremental indexing: Only re-embed documents that changed since the last indexing run. Track changes with timestamps or hashes.
Scheduled re-ingestion: Run nightly jobs to re-ingest from data sources (Notion, Confluence, databases).
Webhook triggers: Set up webhooks so data sources notify your pipeline when content changes.
TTL (Time-To-Live): Set expiration on cached embeddings and answers so stale data is automatically refreshed.

Multi-Modal RAG

Modern RAG systems can handle more than text:

Images: Use multi-modal embedding models (CLIP, OpenAI) to embed and retrieve images alongside text.
Tables: Extract and structure tables from PDFs. Embed table descriptions and content separately.
Code: Use code-specific embedding models for code search. Include file paths and function signatures as metadata.
Audio/Video: Transcribe audio/video to text, then process through the standard RAG pipeline.

Frequently Asked Questions

How much data do I need for RAG?

RAG works with any amount of data. Even a single document can benefit from RAG. The technique scales from a handful of documents to millions. Start small and grow your knowledge base over time.

RAG vs long context: which is better?

For small document sets (under 50 pages), long context may be simpler. For large knowledge bases (100+ documents), RAG is more cost-effective and accurate. RAG also provides citations and supports incremental updates.

Which embedding model should I use?

Start with OpenAI's text-embedding-3-small (good quality, low cost). For higher quality, try text-embedding-3-large or Cohere's embed-v3. For self-hosted, BGE and E5 models are excellent open-source options.

How do I handle multiple languages?

Use multilingual embedding models (Cohere multilingual, BGE-M3). Some models embed text in any language into a shared vector space, enabling cross-language retrieval.

How often should I re-index?

Depends on how often your data changes. For static documentation, weekly or monthly is fine. For dynamic data (support tickets, news), daily or real-time indexing is better.

What if the answer is not in my knowledge base?

Your prompt should instruct the LLM to say "I don't have enough information" when the retrieved context does not contain the answer. You can also set a similarity score threshold and return a fallback message if no chunks meet it.

Can I combine RAG with fine-tuning?

Yes, and it is often the best approach. Fine-tune the model to follow your desired output format and style, then use RAG to provide factual content. This gives you the best of both worlds.

✅

Congratulations! You have completed the RAG course. You now have the knowledge to build, evaluate, and deploy production-quality Retrieval Augmented Generation systems.

← Previous Evaluation Next → Course Home