Optimize document chunking for RAG performance and retrieval quality. Use this skill when splitting documents, choosing chunk sizes, implementing semantic chunking, or improving RAG retrieval accuracy. Activate when: chunking, split documents, chunk size, text splitting, document processing, RAG performance, semantic chunking, overlap.
Install
npx skillscat add latestaiagents/agent-skills/chunking-strategies Install via the SkillsCat registry.
SKILL.md
Chunking Strategies for RAG
Optimal chunking is the difference between good and great RAG performance.
Why Chunking Matters
Poor chunking causes:
- Context fragmentation (answers split across chunks)
- Irrelevant retrieval (too much noise in chunks)
- Lost relationships (parent-child content separated)
- Wasted tokens (chunks too large or too small)
Chunking Methods Comparison
| Method | Best For | Chunk Quality | Implementation |
|---|---|---|---|
| Fixed-size | Simple docs, uniform content | Medium | Easy |
| Recursive | Structured docs, markdown | High | Medium |
| Semantic | Complex docs, varied content | Highest | Complex |
| Parent-child | Hierarchical docs | High | Medium |
| Late chunking | Preserving context | Highest | Complex |
Pattern 1: Fixed-Size with Overlap
The baseline approach - simple but effective:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_fixed_chunks(
text: str,
chunk_size: int = 512,
chunk_overlap: int = 50
) -> list[str]:
"""
Split text into fixed-size chunks with overlap.
Guidelines:
- chunk_size: 256-1024 tokens (512 is solid default)
- overlap: 10-20% of chunk_size
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
return splitter.split_text(text)Pattern 2: Semantic Chunking
Group by meaning, not arbitrary boundaries:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
def create_semantic_chunks(text: str) -> list[str]:
"""
Split text based on semantic similarity between sentences.
Keeps related content together.
"""
embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # Higher = fewer, larger chunks
)
return splitter.split_text(text)Custom Semantic Chunking
import numpy as np
from sentence_transformers import SentenceTransformer
def semantic_chunk(
sentences: list[str],
model_name: str = "all-MiniLM-L6-v2",
threshold: float = 0.5
) -> list[list[str]]:
"""
Group sentences by semantic similarity.
"""
model = SentenceTransformer(model_name)
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Cosine similarity between consecutive sentences
sim = np.dot(embeddings[i-1], embeddings[i]) / (
np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
)
if sim >= threshold:
current_chunk.append(sentences[i])
else:
chunks.append(current_chunk)
current_chunk = [sentences[i]]
chunks.append(current_chunk)
return chunksPattern 3: Parent-Child Chunking
Retrieve small, return with context:
from llama_index.core.node_parser import (
HierarchicalNodeParser,
SentenceSplitter,
get_leaf_nodes
)
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.retrievers import AutoMergingRetriever
def create_hierarchical_index(documents):
"""
Create parent-child chunk hierarchy.
Small chunks for retrieval, auto-merge to parents for context.
"""
# Define chunk sizes for each level
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # Parent → Child → Leaf
)
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
# Store all nodes
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
# Index only leaf nodes
index = VectorStoreIndex(
leaf_nodes,
storage_context=storage_context
)
# Retriever auto-merges to parents when siblings retrieved
retriever = AutoMergingRetriever(
index.as_retriever(similarity_top_k=12),
storage_context=storage_context,
simple_ratio_thresh=0.3 # Merge if 30%+ siblings retrieved
)
return retrieverPattern 4: Late Chunking (2026 Technique)
Embed full document first, then chunk - preserves global context:
def late_chunking(
document: str,
model,
chunk_size: int = 512
) -> list[dict]:
"""
Late chunking: embed document, then split embeddings.
Preserves document-level context in chunk embeddings.
Reference: Jina AI Late Chunking (2024)
"""
# 1. Get token-level embeddings for full document
tokens = model.tokenize(document)
token_embeddings = model.encode_tokens(tokens)
# 2. Split into chunks
chunks = []
for i in range(0, len(tokens), chunk_size):
chunk_tokens = tokens[i:i + chunk_size]
chunk_embeddings = token_embeddings[i:i + chunk_size]
# 3. Pool chunk embeddings (mean pooling)
chunk_vector = np.mean(chunk_embeddings, axis=0)
chunks.append({
"text": model.decode(chunk_tokens),
"embedding": chunk_vector
})
return chunksPattern 5: Markdown/Code-Aware Chunking
from langchain.text_splitter import (
MarkdownHeaderTextSplitter,
Language,
RecursiveCharacterTextSplitter
)
def chunk_markdown(text: str) -> list[dict]:
"""Split markdown by headers, preserving structure."""
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
return splitter.split_text(text)
def chunk_code(code: str, language: str = "python") -> list[str]:
"""Split code respecting language syntax."""
lang_map = {
"python": Language.PYTHON,
"javascript": Language.JS,
"typescript": Language.TS,
}
splitter = RecursiveCharacterTextSplitter.from_language(
language=lang_map.get(language, Language.PYTHON),
chunk_size=1000,
chunk_overlap=100
)
return splitter.split_text(code)Chunk Size Guidelines
| Content Type | Recommended Size | Overlap |
|---|---|---|
| Q&A / FAQ | 256-512 | 25-50 |
| Technical docs | 512-1024 | 50-100 |
| Legal documents | 1024-2048 | 100-200 |
| Code | 500-1000 | 50-100 |
| Conversations | 256-512 | 50-100 |
Evaluation: How to Know If Chunking Is Good
def evaluate_chunking(chunks: list[str], test_queries: list[dict]):
"""
Evaluate chunk quality with test queries.
test_queries format:
[{"query": "What is X?", "expected_chunk_contains": "X is..."}]
"""
results = {
"avg_chunk_size": np.mean([len(c) for c in chunks]),
"chunk_size_std": np.std([len(c) for c in chunks]),
"total_chunks": len(chunks),
"retrieval_hits": 0
}
for tq in test_queries:
# Check if expected content is in a single chunk
for chunk in chunks:
if tq["expected_chunk_contains"] in chunk:
results["retrieval_hits"] += 1
break
results["hit_rate"] = results["retrieval_hits"] / len(test_queries)
return resultsBest Practices
- Match chunk size to query length - Chunks should be similar size to expected queries
- Preserve meaning boundaries - Never split mid-sentence or mid-paragraph
- Include metadata - Add source, page, section info to each chunk
- Test with real queries - Evaluate on your actual use cases
- Consider retrieval model - Some embedding models prefer specific chunk sizes
Quick Decision Tree
What type of content?
├─ Structured (headers, sections)
│ └─ Use: Markdown/recursive splitter + hierarchy
├─ Unstructured (prose, articles)
│ └─ Use: Semantic chunking
├─ Code
│ └─ Use: Language-aware splitter
└─ Mixed
└─ Use: Parent-child with semantic leaves