RAG Best Practices
Production-ready RAG: optimization checklist, common failures, deployment, cost management, scaling, and FAQ.
RAG Optimization Checklist
- Data is cleaned and preprocessed (no garbage text)
- Chunk size tested and optimized (start with 1000 chars)
- Chunk overlap set to 10-20% of chunk size
- Metadata attached to every chunk (source, date, category)
- Embedding model selected and benchmarked
- Retrieval tested with real user queries
- Reranking added (biggest quality improvement)
- Prompt instructs LLM to only use provided context
- Citations included in responses
- Evaluation dataset built with 50+ examples
- RAGAS metrics above target thresholds
- Fallback message for "no relevant context found"
Common Failure Modes and Fixes
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Hallucination | Answer includes facts not in context | Weak prompt, model too creative | Strengthen prompt, lower temperature, add reranking |
| Missing context | "I don't know" when answer exists | Poor chunking or embedding | Adjust chunk size, try different embeddings, multi-query |
| Wrong context | Answer from wrong document | Retriever returning irrelevant results | Add metadata filtering, hybrid search, reranking |
| Stale data | Answer uses outdated information | Knowledge base not updated | Implement incremental indexing, schedule re-ingestion |
| Incomplete answer | Answer is partially correct | Relevant info split across chunks | Use parent-child chunking, increase top-k, increase overlap |
Production Deployment Guide
-
Separate Indexing from Serving
Run the indexing pipeline (ingest, chunk, embed) as a batch job. Run the query pipeline (retrieve, generate) as a web service. This lets you update the index without downtime.
-
Cache Frequently Asked Questions
Cache answers for common queries. Use a semantic cache (embed the query, check if a similar query was recently answered) to avoid redundant LLM calls.
-
Implement Monitoring
Track latency (retrieval time, generation time), quality metrics (faithfulness, relevancy), and usage patterns. Set up alerts for quality degradation.
-
Add a Feedback Loop
Let users rate answers (thumbs up/down). Use this feedback to identify weak areas and expand your evaluation dataset.
Cost Optimization
Use Tiered Models
Route simple questions to cheap models (Claude Haiku, GPT-4o mini). Only use expensive models for complex queries.
Optimize Context Size
Send only the most relevant chunks to the LLM. Fewer tokens = lower cost. Reranking helps select only the best chunks.
Cache Aggressively
Cache embedding results, retrieval results, and LLM responses. A good cache can reduce LLM API calls by 30-50%.
Batch Embeddings
Embed documents in batches rather than one at a time. Most embedding APIs offer batch endpoints at lower per-token costs.
Scaling Strategies
- Vector database scaling: Use managed services (Pinecone, Weaviate Cloud) that handle sharding and replication automatically.
- Horizontal scaling: Run multiple instances of your query service behind a load balancer.
- Async processing: For bulk queries, use a job queue (Redis, SQS) to process requests asynchronously.
- CDN for static content: If your RAG serves documentation, cache generated pages at the CDN level.
Security: Data Access Control
In production RAG systems, different users should only see documents they have access to:
# Store user permissions as metadata on each chunk chunk.metadata["allowed_roles"] = ["engineering", "admin"] # At query time, filter by user's role def search_with_access_control(query, user_roles): return vectorstore.similarity_search( query, k=5, filter={ "allowed_roles": {"$in": user_roles} } )
Keeping Data Fresh
- Incremental indexing: Only re-embed documents that changed since the last indexing run. Track changes with timestamps or hashes.
- Scheduled re-ingestion: Run nightly jobs to re-ingest from data sources (Notion, Confluence, databases).
- Webhook triggers: Set up webhooks so data sources notify your pipeline when content changes.
- TTL (Time-To-Live): Set expiration on cached embeddings and answers so stale data is automatically refreshed.
Multi-Modal RAG
Modern RAG systems can handle more than text:
- Images: Use multi-modal embedding models (CLIP, OpenAI) to embed and retrieve images alongside text.
- Tables: Extract and structure tables from PDFs. Embed table descriptions and content separately.
- Code: Use code-specific embedding models for code search. Include file paths and function signatures as metadata.
- Audio/Video: Transcribe audio/video to text, then process through the standard RAG pipeline.
Frequently Asked Questions
How much data do I need for RAG?
RAG works with any amount of data. Even a single document can benefit from RAG. The technique scales from a handful of documents to millions. Start small and grow your knowledge base over time.
RAG vs long context: which is better?
For small document sets (under 50 pages), long context may be simpler. For large knowledge bases (100+ documents), RAG is more cost-effective and accurate. RAG also provides citations and supports incremental updates.
Which embedding model should I use?
Start with OpenAI's text-embedding-3-small (good quality, low cost). For higher quality, try text-embedding-3-large or Cohere's embed-v3. For self-hosted, BGE and E5 models are excellent open-source options.
How do I handle multiple languages?
Use multilingual embedding models (Cohere multilingual, BGE-M3). Some models embed text in any language into a shared vector space, enabling cross-language retrieval.
How often should I re-index?
Depends on how often your data changes. For static documentation, weekly or monthly is fine. For dynamic data (support tickets, news), daily or real-time indexing is better.
What if the answer is not in my knowledge base?
Your prompt should instruct the LLM to say "I don't have enough information" when the retrieved context does not contain the answer. You can also set a similarity score threshold and return a fallback message if no chunks meet it.
Can I combine RAG with fine-tuning?
Yes, and it is often the best approach. Fine-tune the model to follow your desired output format and style, then use RAG to provide factual content. This gives you the best of both worlds.
Lilly Tech Systems