Document Chunking Strategies

Why Chunking Matters

You can't convert an entire document into a single vector. Two reasons:

Embedding model length limits: Most models only support 512–8192 tokens of input. A long document gets truncated, losing information from the latter half.

Retrieval precision: A vector represents the overall semantics of text. A 10-page document covering many topics produces an "averaged" vector — searching for any specific question won't match well. Splitting into focused chunks, each about one topic, makes search more precise.

Chunking is splitting long documents into pieces suited for embedding and retrieval.

Fixed-Size Chunking

The simplest approach: split at fixed character or token counts.

def fixed_size_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Overlap is crucial. Without it, sentences can be split mid-thought:

No overlap:
Chunk 1: "...the refund review period is"
Chunk 2: "3-5 business days..."

With overlap:
Chunk 1: "...the refund review period is 3-5 business days."
Chunk 2: "the refund review period is 3-5 business days. Refunds..."

Overlap prevents information loss at chunk boundaries.

Pros: Simple, predictable, fast. Cons: Ignores semantic boundaries, may split mid-paragraph or mid-sentence.

Separator-Based Chunking

Smarter: split at natural semantic boundaries.

def split_by_separators(text, separators=["\n\n", "\n", ". ", " "]):
    """Recursive split: try large separators first, then smaller ones"""
    chunks = []
    for sep in separators:
        if sep in text:
            parts = text.split(sep)
            for part in parts:
                if len(part) <= chunk_size:
                    chunks.append(part)
                else:
                    chunks.extend(split_by_separators(part, separators[1:]))
            return chunks
    return fixed_size_chunk(text)

Priority: paragraph breaks (\n\n) → newlines (\n) → periods → spaces. Keep complete paragraphs and sentences whenever possible.

This is the core idea behind LangChain's RecursiveCharacterTextSplitter.

Semantic Chunking

The smartest approach: use embeddings to determine where to split.

Principle:

Split document into sentences
Compute embedding similarity between adjacent sentences
When similarity drops sharply (topic change), split there

def semantic_chunk(sentences, threshold=0.5):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        sim = cosine_similarity(
            embed(sentences[i-1]),
            embed(sentences[i])
        )
        if sim < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Pros: Semantically coherent chunks. Cons: Extra embedding computation, slower; threshold needs tuning.

Format-Specific Chunking

Markdown

Markdown has natural structure — headings are ideal split points:

def markdown_chunk(text):
    sections = re.split(r'\n(?=#{1,3} )', text)
    chunks = []
    for section in sections:
        if len(section) > chunk_size:
            chunks.extend(split_by_paragraphs(section))
        else:
            chunks.append(section)
    return chunks

Code

Split code by functions or classes, not character count:

# Good: complete function
def calculate_tax(income):
    if income <= 36000:
        return income * 0.03
    elif income <= 144000:
        return income * 0.1 - 2520

# Bad: function cut in half
def calculate_tax(income):
    if income <= 36000:
        return income * 0.03
# --- cut here ---
    elif income <= 144000:
        return income * 0.1 - 2520

Choosing Chunk Size

Size	Pros	Cons	Best For
Small (100–200 tokens)	Precise retrieval	Lacks context	Precise Q&A
Medium (300–500 tokens)	Balanced	—	General use
Large (500–1000 tokens)	Rich context	Less precise retrieval	Summarization, analysis

Start with 300–500 tokens and adjust based on results.

Tips for Better Chunking

1. Preserve Metadata

chunk = {
    "text": "Refund review takes 3-5 business days...",
    "metadata": {
        "source": "refund-policy.md",
        "section": "Review Process",
        "page": 3
    }
}

Metadata doesn't go into the embedding but is invaluable in search results (filtering, showing sources).

2. Add Context Prefixes

Prepend the section heading to each chunk:

Original: "Review period is 3-5 business days."
Enhanced: "Refund Policy > Review Process: Review period is 3-5 business days."

This helps the embedding model understand the chunk's context, improving retrieval precision.

3. Small-to-Big Retrieval

An advanced strategy: use small chunks for retrieval (more precise), but return the larger parent chunk to the LLM (more context).

Index: small chunks (200 tokens) → for vector search
Return: large chunks (800 tokens) → containing the small chunk + surrounding context

Key Takeaways

Chunking is a critical factor in RAG quality. Too large = imprecise retrieval, too small = lacking context.
Start with recursive character splitting (paragraphs → sentences → characters) — the most practical general-purpose approach.
Overlap matters — 10–20% overlap prevents information loss at boundaries.
Start with 300–500 tokens as chunk size, adjust based on your task.
Preserving metadata and adding context prefixes significantly improve retrieval quality.