christian-bromann

langchain-text-splitters

Guide to using text splitter integrations in LangChain including recursive, character, and semantic splitters

christian-bromann 3 1 Updated 3mo ago
GitHub

Install

npx skillscat add christian-bromann/langchain-skills/skills-langchain-integrations-text-splitters-python

Install via the SkillsCat registry.

SKILL.md

langchain-text-splitters (Python)

Overview

Text splitters divide large documents into smaller chunks that fit within model context windows and enable effective retrieval. Proper chunking is critical for RAG system performance.

Key Concepts

  • Chunk Size: Target size for each text chunk (in characters or tokens)
  • Chunk Overlap: Number of characters/tokens to overlap between chunks
  • Separators: Characters used to split text
  • Metadata: Preserved and enriched during splitting

Splitter Selection Decision Table

Splitter Best For Package Key Features
RecursiveCharacterTextSplitter General purpose langchain-text-splitters Hierarchical splitting
CharacterTextSplitter Simple splitting langchain-text-splitters Single separator
TokenTextSplitter Token-aware langchain-text-splitters Actual token counts
MarkdownHeaderTextSplitter Markdown langchain-text-splitters Preserves headers
SemanticChunker Semantic boundaries langchain-experimental AI-driven splitting

When to Choose Each Splitter

Choose RecursiveCharacterTextSplitter if:

  • General purpose text (default choice)
  • Want to preserve structure
  • Need balanced chunks

Choose TokenTextSplitter if:

  • Need precise token counts
  • Character counts unreliable

Choose SemanticChunker if:

  • Want AI to determine boundaries
  • Quality over speed

Code Examples

RecursiveCharacterTextSplitter (Recommended)

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Basic usage
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True,  # Adds start_index to metadata
)

text = "Long document text here..."
chunks = splitter.split_text(text)

print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {len(chunk)} characters")

# Split documents (preserves metadata)
from langchain_core.documents import Document

docs = [
    Document(
        page_content="Long text...",
        metadata={"source": "doc1.pdf", "page": 1}
    )
]

split_docs = splitter.split_documents(docs)
# Metadata preserved and enriched
print(split_docs[0].metadata)

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

# Split by single separator
splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
)

chunks = splitter.split_text(text)

TokenTextSplitter (Token-Aware)

from langchain_text_splitters import TokenTextSplitter

# Split based on actual tokens
splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

chunks = splitter.split_text(text)

# Uses tiktoken for OpenAI token counting
# More accurate than character counting

Markdown Splitter

from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown = """
# Header 1
Content 1

## Header 1.1
Content 1.1

# Header 2
Content 2
"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

splits = splitter.split_text(markdown)

# Each split preserves header hierarchy in metadata
for doc in splits:
    print(doc.metadata)
    print(doc.page_content)

Code Splitter

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# Python code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=500,
    chunk_overlap=50,
)

python_code = """
def function1():
    pass

class MyClass:
    def method1(self):
        pass
"""

chunks = python_splitter.split_text(python_code)

# JavaScript splitter
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=500,
    chunk_overlap=50,
)

Semantic Chunker (Experimental)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# AI-driven semantic splitting
splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"  # or "standard_deviation", "interquartile"
)

chunks = splitter.split_text(text)
# Splits at semantic boundaries, not fixed sizes

Splitting with Vector Store Integration

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import WebBaseLoader

# Complete RAG pipeline
loader = WebBaseLoader("https://docs.example.com")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

split_docs = splitter.split_documents(docs)

vectorstore = FAISS.from_documents(
    split_docs,
    OpenAIEmbeddings()
)

# Ready for semantic search
results = vectorstore.similarity_search("query", k=4)

Custom Length Function

from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

# Use actual token counter
def tiktoken_len(text):
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=tiktoken_len,
)

Splitting Large PDFs

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("large-document.pdf")
pages = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,
)

chunks = splitter.split_documents(pages)

print(f"{len(pages)} pages → {len(chunks)} chunks")

# Metadata includes original page number
for chunk in chunks:
    print(chunk.metadata)

Boundaries

What Agents CAN Do

Split text intelligently

  • Recursive splitting to preserve structure
  • Configure chunk size and overlap
  • Choose separators

Handle various formats

  • Plain text, markdown, code
  • Documents with metadata
  • Structured data

Optimize for use case

  • Balance size vs context
  • Token-based splitting
  • Semantic splitting

What Agents CANNOT Do

Guarantee semantic boundaries

  • Uses heuristics, not perfect understanding
  • May split mid-sentence

Perfectly estimate tokens

  • Character splitters approximate
  • Use TokenTextSplitter for exact counts

Gotchas

1. Chunk Size vs Token Limits

# ❌ Character count != token count
splitter = RecursiveCharacterTextSplitter(chunk_size=4000)
# May exceed 4096 token limit!

# ✅ Use token-aware splitter
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=4000)

Fix: Use TokenTextSplitter for token precision.

2. Import from Correct Package

# ❌ OLD
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ✅ NEW
from langchain_text_splitters import RecursiveCharacterTextSplitter

Fix: Use langchain-text-splitters package.

3. Zero Overlap

# ❌ No overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,  # Context lost at boundaries
)

# ✅ Use overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # 20% overlap
)

Fix: Always use 10-20% overlap.

4. Metadata Not Preserved

# ❌ splitText loses metadata
chunks = splitter.split_text(text)

# ✅ Use split_documents
docs = [Document(page_content=text, metadata={"source": "file"})]
chunks = splitter.split_documents(docs)

Fix: Use split_documents() to preserve metadata.

Links and Resources

Official Documentation

Package Installation

pip install langchain-text-splitters

# For semantic chunker
pip install langchain-experimental