Intermediate

Chunking Strategies

How you split documents into chunks dramatically affects retrieval quality. Learn the strategies and choose the right one for your data.

Why Chunking Matters

Chunking determines the granularity of your retrieval. Too large and you retrieve irrelevant content. Too small and you lose context. The right chunk size depends on your data and use case.

Chunking Strategies Compared

Strategy How It Works Best For Pros Cons
Fixed-size Split every N characters Simple, uniform docs Simple, predictable Breaks mid-sentence
Recursive Split by separators hierarchy General-purpose text Respects structure Needs separator tuning
Sentence Split at sentence boundaries Narrative text, articles Natural boundaries Variable chunk sizes
Semantic Split by meaning shifts Mixed-topic documents Topic-coherent chunks Requires embeddings
Document Split by headings/sections Structured documents Preserves sections Needs structured input

1. Recursive Character Splitting

The most popular and recommended default strategy. It recursively splits text using a hierarchy of separators:

Python - Recursive Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Target chunk size in characters
    chunk_overlap=200,      # Overlap between chunks
    separators=[
        "\n\n",   # First try: paragraph breaks
        "\n",     # Then: line breaks
        ". ",     # Then: sentence boundaries
        " ",      # Then: word boundaries
        ""        # Finally: character level
    ]
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

2. Semantic Chunking

Uses embeddings to detect topic shifts and create semantically coherent chunks:

Python - Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

chunks = splitter.split_documents(documents)
# Each chunk contains semantically related content

3. Sentence-Based Splitting

Python - Sentence Splitting
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50,
    tokens_per_chunk=256
)

chunks = splitter.split_documents(documents)

Chunk Size Selection

The optimal chunk size depends on your use case:

Chunk Size Tokens Best For Trade-offs
Small (256-512) ~100-200 Precise Q&A, specific facts May lose context; more chunks to search
Medium (512-1024) ~200-400 General purpose (recommended start) Good balance of context and precision
Large (1024-2048) ~400-800 Summarization, complex topics More context but may include irrelevant info

Chunk Overlap

Overlap ensures that information at chunk boundaries is not lost. A good default is 10-20% of chunk size:

Overlap Example
# chunk_size=1000, chunk_overlap=200

Chunk 1: [characters 0-999]
Chunk 2: [characters 800-1799]  ← 200 char overlap
Chunk 3: [characters 1600-2599] ← 200 char overlap

// Without overlap, a sentence split across chunks
// would be incomplete in both chunks.
// With overlap, the full sentence appears in at least one chunk.

Hierarchical Chunking (Parent-Child)

Create two levels of chunks: large parent chunks for context and small child chunks for precise retrieval:

Python - Parent-Child Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parent chunks: large, for context
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200
)

# Child chunks: small, for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

# Strategy: Retrieve by child chunks, but pass
# the parent chunk to the LLM for more context
parent_chunks = parent_splitter.split_documents(docs)
for parent in parent_chunks:
    children = child_splitter.split_documents([parent])
    for child in children:
        child.metadata["parent_id"] = parent.metadata["id"]
Start simple: Begin with RecursiveCharacterTextSplitter at chunk_size=1000 and chunk_overlap=200. This works well for most use cases. Optimize from there based on evaluation results.

Metadata-Enriched Chunks

Add context to each chunk to improve retrieval:

Python - Enriched Chunks
# Prepend document title and section to each chunk
for chunk in chunks:
    title = chunk.metadata.get("title", "")
    section = chunk.metadata.get("section", "")
    chunk.page_content = f"Document: {title}\nSection: {section}\n\n{chunk.page_content}"

What's Next?

The next lesson covers vector databases and search — how to store and efficiently search your embedded chunks.