In Post 5, you saw the uncomfortable ordering: retrieval happens before generation.
So if retrieval fails, the LLM often doesn't get a chance.
Now the question: what decides what retrieval can possibly return?
Most of the time, it's chunking.
Why Chunking Matters More Than You Think
Here's the constraint people ignore:
Retrieval returns chunks - not documents.
So anything you split apart becomes harder to retrieve as a single idea. Anything you mix together becomes harder to retrieve precisely.
Your embedding model can't "fix" missing context. It can only embed what you give it.
The Core Tradeoffs
Every chunking strategy is balancing three things:
- Granularity: do you want precise matches, or broader context?
- Coherence: does a chunk contain a complete thought (or half a sentence)?
- Cost: more chunks means more storage, more indexing time, and more retrieval candidates.
You're trading off precision vs context vs cost.
Chunk Size as Downstream Constraint
Chunk size doesn't exist in isolation. It interacts with:
- Embedding model max input - chunks can't exceed this
- Retriever
top_k- more chunks retrieved = more chances to hit, but more noise - Reranker cost - reranking 50 chunks is expensive
- Final prompt budget - retrieved chunks compete for context window space
Smaller chunks give you precision but require higher top_k to cover the answer. Larger chunks give you context but may dilute relevance scores.
Strategy 1: Fixed-Size Chunking (Baseline)
Split text every N tokens (or characters), regardless of meaning.
"N tokens" depends on the tokenizer and model; "N characters" is easier to implement but less aligned with model limits.
Pros
- Simple and predictable
- Works on any text
- Good baseline for early prototypes
Cons
- Splits mid-sentence
- Breaks tables/lists
- Separates definitions from the thing being defined
When it's fine
- Logs, transcripts, uniform text
- Small corpora where "good enough" is acceptable
- You're still validating the rest of the pipeline
Where it breaks first
- Anything with structure (docs, wikis, policies, papers)
Strategy 2: Structure-Aware Recursive Chunking (Default)
If you had to pick one general-purpose approach, this is usually the best starting point:
- Try to split by bigger boundaries first (sections / paragraphs)
- If chunks are still too large, split smaller (sentences / words)
- Only fall back to character splitting as a last resort
That's the intuition behind "recursive" splitters: respect structure when it exists.
Pros
- Produces more readable chunks
- Reduces mid-thought breaks
- Works across mixed doc types
Cons
- Still not truly "semantic" (it uses structure, not meaning)
- Needs different separators depending on format (markdown vs HTML vs plain text)
Strategy 3: Header-Based Chunking (Markdown/Docs)
If your docs have headings, use them.
A header is free metadata: it tells you "what this chunk is about." If you split without preserving headers, you lose the best retrieval anchor you had.
Do this
- Split by headers first
- Keep the header text attached to the content it describes
- Optionally prepend a "path" like:
Product → Billing → Refunds
Common failure
- Header in one chunk, body in the next → retrieval returns a title with no details.
Strategy 4: Code-Aware Chunking (Repos)
For code, "N tokens" is the wrong primitive.
Split by:
- class / function boundaries
- file boundaries + local context (imports, docstrings)
- logical blocks (configs, schemas)
If you split code like prose, you get chunks that compile in nobody's head.
Strategy 5: Semantic Chunking (Expensive, Sometimes Worth It)
Semantic chunking tries to split where meaning changes, not where formatting changes.
A common pattern:
- split into sentences
- embed each sentence (or small window)
- compute similarity between adjacent sentences
- cut when similarity drops sharply
The appeal
- Chunks tend to contain one coherent "topic"
- Less mixing of unrelated ideas
The catch
- It's expensive: you're embedding at chunking time, not only retrieval time
- It's harder to debug
- Benefits are inconsistent across tasks
Semantic chunking gets oversold. Treat it as an optimization you earn, not a default you assume.
If you already have reranking + hybrid search, semantic chunking is usually not your first lever.
Overlap: The Boundary Insurance
Overlap means repeating a small slice of text between adjacent chunks.
Why it helps:
- prevents "definition in chunk A, term usage in chunk B"
- protects against boundary splits
- improves recall for boundary-adjacent questions
Why it hurts:
- increases index size
- increases duplicate retrieval
- can waste context window budget unless you dedupe
Practical rule: use some overlap if you see boundary failures. Otherwise keep it minimal and measure.
Handling duplicates: dedupe by chunk id / source+offset, or by exact text match before prompting. Overlap without dedupe wastes context budget.
Special Cases That Break Naive Chunking
Tables
Tables don't embed well when they're cut in half.
Options:
- keep small tables intact
- convert tables to "row-per-line" text
- store table structure separately and retrieve by metadata
PDFs
PDF text extraction often destroys structure (columns, headers/footers, page breaks). If extraction is messy, chunking can't rescue it.
Rule: fix extraction before you tune chunking.
Boilerplate / Repeated Headers
PDFs and docs often have headers/footers/page numbers repeated on every page. If you don't remove them before chunking, they pollute embeddings and dilute relevance.
Lists and Procedures
Procedures ("Step 1… Step 2…") are brittle. If you split steps apart, retrieval returns an incomplete procedure and the LLM fills gaps.
Common Chunking Failures (Real Symptoms)
- Mid-sentence chunks → retrieved text reads like it starts mid-breath
- Orphaned references ("this", "it", "the above") → chunk is technically "relevant" but unusable
- Definition separated from usage → retrieved chunk mentions a term but not its meaning
- Header-body separation → retrieval finds a title, not the explanation
- Mixed topics → one chunk contains three concepts, retrieval pulls noise with the signal
The "Right Chunk" Test
A chunk is good if it passes:
- Standalone readable - no "this/it/above" without referent
- Key term + definition nearby - the chunk contains what it references
- Stable anchor - header/path/source metadata attached
If a chunk fails any of these, retrieval might return it, but the LLM can't use it.
Debug Checklist: If Retrieval Feels Dumb
- Print the retrieved chunks (don't guess)
- Check if chunks are human-readable and self-contained
- Check boundary damage (sentences, steps, tables, headers)
- Check format mismatch (markdown treated like plain text, code treated like prose)
- Try one alternative chunker on the same doc and compare retrieval side-by-side
- Only then start touching embeddings, rerankers, or prompts
Try This Yourself
Take one document (2–5 pages) that you actually care about.
Chunk it three ways:
- fixed-size, no overlap
- fixed-size, with overlap
- structure-aware recursive (paragraphs → sentences)
Then:
- index all three versions
- ask the same 5 questions
- for each question, log retrieved chunk ids + source offsets, then inspect
You're looking for one thing:
Which chunking strategy most often returns a chunk that contains the answer and enough context to use it?
Key Takeaways
- Retrieval returns chunks, so chunking defines what retrieval can return.
- Fixed-size chunking is a baseline - simple, but structure-blind.
- Structure-aware recursive chunking is the safest default for mixed documents.
- For markdown, headers are retrieval anchors - keep them attached to content.
- Semantic chunking can help, but it costs more and the wins aren't guaranteed.
- Chunking bugs look like "retrieval is dumb" - print chunks before changing anything else.
Key Terms
- Chunking: splitting documents into smaller units for embedding + retrieval.
- Chunk overlap: repeated content between adjacent chunks to reduce boundary loss.
- Recursive chunking: hierarchical splitting using progressively smaller separators.
- Semantic chunking: splitting based on meaning shifts (often via embedding similarity).
- Orphaned context: a chunk that refers to missing surrounding information.
What’s Next
Now you can create chunks that can be retrieved. Next question:
Where should they live, and when do you actually need a vector database?
In the next post Vector DBs vs Plain Indexes, we’ll compare:
- dedicated vector DBs
- pgvector / Postgres
- plain indexes + hybrid search
…and how to choose based on scale and constraints.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.