Chunking Strategies
How you split documents into chunks dramatically affects retrieval quality. Learn the strategies and choose the right one for your data.
Why Chunking Matters
Chunking determines the granularity of your retrieval. Too large and you retrieve irrelevant content. Too small and you lose context. The right chunk size depends on your data and use case.
Chunking Strategies Compared
| Strategy | How It Works | Best For | Pros | Cons |
|---|---|---|---|---|
| Fixed-size | Split every N characters | Simple, uniform docs | Simple, predictable | Breaks mid-sentence |
| Recursive | Split by separators hierarchy | General-purpose text | Respects structure | Needs separator tuning |
| Sentence | Split at sentence boundaries | Narrative text, articles | Natural boundaries | Variable chunk sizes |
| Semantic | Split by meaning shifts | Mixed-topic documents | Topic-coherent chunks | Requires embeddings |
| Document | Split by headings/sections | Structured documents | Preserves sections | Needs structured input |
1. Recursive Character Splitting
The most popular and recommended default strategy. It recursively splits text using a hierarchy of separators:
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Target chunk size in characters chunk_overlap=200, # Overlap between chunks separators=[ "\n\n", # First try: paragraph breaks "\n", # Then: line breaks ". ", # Then: sentence boundaries " ", # Then: word boundaries "" # Finally: character level ] ) chunks = splitter.split_documents(documents) print(f"Created {len(chunks)} chunks")
2. Semantic Chunking
Uses embeddings to detect topic shifts and create semantically coherent chunks:
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings() splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=95 ) chunks = splitter.split_documents(documents) # Each chunk contains semantically related content
3. Sentence-Based Splitting
from langchain.text_splitter import SentenceTransformersTokenTextSplitter splitter = SentenceTransformersTokenTextSplitter( chunk_overlap=50, tokens_per_chunk=256 ) chunks = splitter.split_documents(documents)
Chunk Size Selection
The optimal chunk size depends on your use case:
| Chunk Size | Tokens | Best For | Trade-offs |
|---|---|---|---|
| Small (256-512) | ~100-200 | Precise Q&A, specific facts | May lose context; more chunks to search |
| Medium (512-1024) | ~200-400 | General purpose (recommended start) | Good balance of context and precision |
| Large (1024-2048) | ~400-800 | Summarization, complex topics | More context but may include irrelevant info |
Chunk Overlap
Overlap ensures that information at chunk boundaries is not lost. A good default is 10-20% of chunk size:
# chunk_size=1000, chunk_overlap=200 Chunk 1: [characters 0-999] Chunk 2: [characters 800-1799] ← 200 char overlap Chunk 3: [characters 1600-2599] ← 200 char overlap // Without overlap, a sentence split across chunks // would be incomplete in both chunks. // With overlap, the full sentence appears in at least one chunk.
Hierarchical Chunking (Parent-Child)
Create two levels of chunks: large parent chunks for context and small child chunks for precise retrieval:
from langchain.text_splitter import RecursiveCharacterTextSplitter # Parent chunks: large, for context parent_splitter = RecursiveCharacterTextSplitter( chunk_size=2000, chunk_overlap=200 ) # Child chunks: small, for precise retrieval child_splitter = RecursiveCharacterTextSplitter( chunk_size=400, chunk_overlap=50 ) # Strategy: Retrieve by child chunks, but pass # the parent chunk to the LLM for more context parent_chunks = parent_splitter.split_documents(docs) for parent in parent_chunks: children = child_splitter.split_documents([parent]) for child in children: child.metadata["parent_id"] = parent.metadata["id"]
RecursiveCharacterTextSplitter at chunk_size=1000 and chunk_overlap=200. This works well for most use cases. Optimize from there based on evaluation results.Metadata-Enriched Chunks
Add context to each chunk to improve retrieval:
# Prepend document title and section to each chunk for chunk in chunks: title = chunk.metadata.get("title", "") section = chunk.metadata.get("section", "") chunk.page_content = f"Document: {title}\nSection: {section}\n\n{chunk.page_content}"
What's Next?
The next lesson covers vector databases and search — how to store and efficiently search your embedded chunks.
Lilly Tech Systems