Notes on text splitting strategies for preparing documents for embedding and retrieval in RAG pipelines.
- What is a Text Splitter?
- Role in RAG Pipeline
- Important Parameters
- Types of Text Splitters
- Chunk Overlap Concept
- Length-Based Splitting
- Choosing Chunk Size
A Text Splitter breaks large documents into smaller chunks before sending them to embeddings or the LLM.
Why it's needed:
- LLMs have token limits — large documents don't fit in one call
- Better embedding quality — smaller chunks embed more precisely
- Better retrieval in RAG — targeted chunks improve relevance
- Prevent context loss — overlapping ensures no information is cut off
Raw Docs → Text Splitter → Chunks → Embeddings → Vector DB → Retriever → LLM
Core idea: Better chunking → Better retrieval → Better RAG performance
| Parameter | Description |
|---|---|
chunk_size |
Maximum size of each chunk (characters or tokens) |
chunk_overlap |
Amount of text repeated between adjacent chunks |
separator |
Symbol or pattern used for splitting |
Splits based on character count. Simple and basic — may cut sentences randomly.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=100
)
chunks = splitter.split_documents(documents)Best for: Small or simple text
Splits hierarchically, trying separators in order: "\n\n" → "\n" → " " → "". Maintains semantic meaning better.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100
)
chunks = splitter.split_documents(documents)Best for: RAG systems, general-purpose chunking
Splits based on token count (not characters). Good when strict token limits are required.
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=500,
chunk_overlap=50
)Best for: OpenAI or token-based models
Splits using markdown headers, keeping document structure intact.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)Best for: README files, technical documentation
| Splitter | Basis | Preserves Meaning | Use Case |
|---|---|---|---|
CharacterTextSplitter |
Characters | ❌ | Simple text |
RecursiveCharacterTextSplitter |
Hierarchy | ✅ | RAG (recommended) |
TokenTextSplitter |
Tokens | Token-limited models | |
MarkdownHeaderTextSplitter |
Headers | ✅ | Structured docs |
Chunk overlap prevents context from being cut off at chunk boundaries:
chunk_size = 500
chunk_overlap = 100
Chunk 1 → characters 0–500
Chunk 2 → characters 400–900 (100-char overlap with Chunk 1)
Chunk 3 → characters 800–1300 (100-char overlap with Chunk 2)
Why overlap matters:
- Prevents context break at boundaries
- Improves retrieval continuity
- Ensures related sentences stay accessible
Length-based splitting splits text purely based on size (character count or token count), without considering meaning or structure.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
separator="\n"
)| Advantage | Disadvantage |
|---|---|
| Simple to implement | Ignores semantic meaning |
| Easy to control size | May break sentences |
| Good for strict limits | Not ideal for structured docs |
More accurate for LLM token limits, using the model's actual tokenizer.
| Model Size | Recommended Chunk Size |
|---|---|
| Small models | 300–500 tokens |
| Large models | 500–1000 tokens |
General rule: More overlap → Better retrieval, but more storage usage.
- Chunking affects retrieval quality
RecursiveCharacterTextSplitteris the most commonly used- Overlap prevents information loss at chunk boundaries
- Token splitting is useful for models with strict context limits