You've built the pipeline: chunking, embedding, retrieval, generation. It works on your test cases.
Then it breaks in production - and you don't know why.
This post is about how RAG systems fail, and how to identify which part failed without guessing.
Step Zero: Was It Retrieval, Assembly, or Generation?
Most RAG debugging wastes time because people debug the wrong component.
Before touching anything, answer this:
Did retrieval return the right chunks?
- If no, it's a retrieval failure.
- If yes, keep going.
Next question:
Did you assemble the context in a way the model can actually use?
- If no, it's a context assembly failure.
- If yes, it's a generation failure (prompt/model behavior).
Same symptom ("bad answer"), different root cause.
Retrieval Failures
In practice, a retrieval failure means: the information exists in your corpus, but it doesn't reach the LLM in the retrieved context.
1) Missing chunks (nothing relevant retrieved)
Symptoms
- retrieved chunks are off-topic
- the answer becomes generic, or the system refuses
Common causes
- query uses different vocabulary than the docs (semantic mismatch)
- embedding model isn't suited to your domain
- chunks are too large (relevance diluted) or too small (context fragmented)
- indexing didn't include what you think it included
Fixes
- add hybrid search (BM25 + vectors)
- add query rewriting / expansion for messy user queries
- revisit chunking and doc cleaning
- sanity-check the index contents (spot-check real chunks)
2) Wrong chunks ranked highly (topically related, not answer-relevant)
Symptoms
- chunks look "close" but don't contain the key detail
- you keep seeing adjacent sections instead of the exact section
Common causes
- chunks mix multiple concepts (topic soup)
- semantic similarity captures theme, not specificity
top_kis too small and the right chunk is slightly below the cutoff
Fixes
- add a reranker (retrieve more, then rank precisely)
- improve chunking to separate distinct topics
- use metadata filters to narrow the candidate set
3) Information split across chunks (context fracture)
Symptoms
- partial answers that need adjacent context
- orphaned references ("this", "it", "the above") with no referent
Common causes
- chunk boundaries cut logical units
- overlap is too low (or zero)
- tables/procedures got split
Fixes
- increase overlap selectively (then dedupe before prompting)
- use structure-aware chunking (headers, clauses, steps)
- keep tables and step-by-step procedures intact
Context Assembly Failures
This category is easy to miss:
the right chunks were retrieved, but assembled badly.
That's not retrieval failure (wrong chunks) and not generation failure (model ignores good context). It's the glue layer.
1) Poor ordering
Chunks concatenated by document order instead of relevance. Key information ends up buried where models are less reliable.
Fix
- order chunks by relevance score (or reranker score), not by source position
2) Too much context
More context isn't "more correct". Noise dilutes signal and competes for attention.
Fix
- reduce to the smallest set of chunks that actually answer the question
- prefer "top 3 excellent chunks" over "top 20 maybe-related chunks"
3) Unusable formatting
If the model can't parse the context, it can't use it.
Fix
- use clear separators between chunks
- label sources consistently
- keep each chunk readable (avoid broken extraction)
Generation Failures
Generation failures mean: the LLM had usable context, but still produced the wrong output.
1) Ignoring retrieved context
Symptoms
- answer contradicts the chunks
- answer is generic despite specific context
Common causes
- weak grounding contract ("use the context" but no enforcement)
- context is long and unstructured
- the model falls back to pretraining priors
Fixes
- strengthen the contract: "Answer ONLY from the provided context"
- format context as numbered sources
- reduce temperature; reduce context length; improve ordering
2) Hallucinating despite correct context
Symptoms
- claims appear that are not supported by any chunk
- the model "fills gaps" with plausible details
Common causes
- partial context invites completion
- context contains noise or ambiguity
- the prompt allows synthesis without constraints
Fixes
- enforce refusal: "If it's not in the context, say you don't know"
- require citations for each claim (forces mapping)
- rerank harder; trim context; remove boilerplate noise
3) Wrong format or incomplete synthesis
Symptoms
- technically correct but unusable (missing citations, wrong structure)
- incomplete answer (only covers one part of the question)
Fixes
- specify output format explicitly
- validate output (schema checks, citation checks)
- split tasks: "extract facts" → "compose answer" (two-step prompts)
Position Effects: Lost in the Middle
Even when you do everything "right", long contexts introduce a specific failure mode:
models don't use all context uniformly.
As context grows, performance can become position-sensitive: information in the middle can be easier to miss than information near the start or end.
Why this matters for RAG
If your best chunk lands in the middle of a long assembled context, the model may underuse it - even if it's present.
Mitigations
- order chunks by relevance (not document order)
- keep contexts short and high-signal
- place the most important chunk first
- for synthesis-heavy questions, extract a short "supported facts" list (with citations) before composing the final answer
- for synthesis-heavy questions: summarize retrieved facts first, then answer
The Grounding vs Summarization Tension
RAG often asks for two conflicting behaviors:
- Grounding: stick strictly to retrieved content
- Summarization: synthesize across multiple sources
Summarization invites interpolation. Interpolation easily becomes hallucination.
How to manage it
- require citations for claims (not only for the final paragraph)
- be explicit about allowed synthesis:
- "combine sources" vs "only report what is explicitly stated"
- for high-stakes answers, separate steps:
- extract supported facts with citations
- generate the final response from that fact list
A Debugging Framework (Order Matters)
Step 1: Check retrieval first
- Are the right chunks in the top-k?
- Is the answer missing entirely?
- Is the answer split across chunks?
If retrieval is broken, stop. Fix retrieval.
Step 2: Check context assembly
- Are chunks ordered by relevance?
- Is there too much noise?
- Is formatting parseable?
- Is the key chunk buried?
If assembly is broken, fix assembly.
Step 3: Check generation last
- Is the model obeying the grounding contract?
- Are there unsupported claims?
- Is format/citation behavior correct?
Only now tune prompts/models.
What to Log Per Query (So Debugging Is Real)
| Field | Why it matters |
|---|---|
| Query (raw + normalized) | lets you reproduce failures |
Top-k results (doc_id, chunk_id, offsets) | shows whether retrieval is sane |
| Scores (vector + rerank if present) | reveals ranking problems |
| Final context order | catches position effects |
| Output + citations | lets you trace claims |
| "Unsupported claim" flags | shows hallucination leakage |
Try This Yourself
Pick one failure your system produces.
- Log the retrieved chunks.
- Manually check: is the answer in those chunks?
- If yes → generation/assembly failure. Fix contract, ordering, formatting.
- If no → retrieval failure. Fix chunking, cleaning, search strategy.
- Retest the exact same query until it's stable.
That exercise teaches you more than 10 tutorials.
Key Takeaways
- Diagnose retrieval vs assembly vs generation before changing anything
- Retrieval failures often mean the LLM didn't have a chance - fix retrieval first
- Assembly failures are "glue bugs": right chunks, wrong ordering/format/length
- Generation failures are contract/behavior issues: enforce grounding and citations
- Long context introduces position effects - shorter, better-ordered context wins
Key Terms
- Retrieval failure: relevant info exists but isn't retrieved
- Context assembly failure: right chunks retrieved but presented badly
- Generation failure: right chunks present and usable, but output is wrong
- Lost in the middle: position sensitivity in long contexts
- Grounding: constraining output to retrieved context
Further Reading
- Long-context position effects (“Lost in the Middle”, Liu et al., 2023): https://arxiv.org/abs/2307.03172
What's Next
Now you know how RAG breaks. Next question:
How do you know your system is actually working?
In the next post Evaluation for LLM Apps, we'll cover why "it looks good" isn't evaluation - and how to measure RAG and LLM system quality.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.