Skip to main content
Back to Learning Series
LLM Fundamentals • Part 8

RAG Failure Modes

You've built the pipeline: chunking, embedding, retrieval, generation. It works on your test cases.

Then it breaks in production - and you don't know why.

This post is about how RAG systems fail, and how to identify which part failed without guessing.


Step Zero: Was It Retrieval, Assembly, or Generation?

Most RAG debugging wastes time because people debug the wrong component.

Before touching anything, answer this:

Did retrieval return the right chunks?

  • If no, it's a retrieval failure.
  • If yes, keep going.

Next question:

Did you assemble the context in a way the model can actually use?

  • If no, it's a context assembly failure.
  • If yes, it's a generation failure (prompt/model behavior).

Same symptom ("bad answer"), different root cause.


Retrieval Failures

In practice, a retrieval failure means: the information exists in your corpus, but it doesn't reach the LLM in the retrieved context.

1) Missing chunks (nothing relevant retrieved)

Symptoms

  • retrieved chunks are off-topic
  • the answer becomes generic, or the system refuses

Common causes

  • query uses different vocabulary than the docs (semantic mismatch)
  • embedding model isn't suited to your domain
  • chunks are too large (relevance diluted) or too small (context fragmented)
  • indexing didn't include what you think it included

Fixes

  • add hybrid search (BM25 + vectors)
  • add query rewriting / expansion for messy user queries
  • revisit chunking and doc cleaning
  • sanity-check the index contents (spot-check real chunks)

Symptoms

  • chunks look "close" but don't contain the key detail
  • you keep seeing adjacent sections instead of the exact section

Common causes

  • chunks mix multiple concepts (topic soup)
  • semantic similarity captures theme, not specificity
  • top_k is too small and the right chunk is slightly below the cutoff

Fixes

  • add a reranker (retrieve more, then rank precisely)
  • improve chunking to separate distinct topics
  • use metadata filters to narrow the candidate set

3) Information split across chunks (context fracture)

Symptoms

  • partial answers that need adjacent context
  • orphaned references ("this", "it", "the above") with no referent

Common causes

  • chunk boundaries cut logical units
  • overlap is too low (or zero)
  • tables/procedures got split

Fixes

  • increase overlap selectively (then dedupe before prompting)
  • use structure-aware chunking (headers, clauses, steps)
  • keep tables and step-by-step procedures intact

Context Assembly Failures

This category is easy to miss:

the right chunks were retrieved, but assembled badly.

That's not retrieval failure (wrong chunks) and not generation failure (model ignores good context). It's the glue layer.

1) Poor ordering

Chunks concatenated by document order instead of relevance. Key information ends up buried where models are less reliable.

Fix

  • order chunks by relevance score (or reranker score), not by source position

2) Too much context

More context isn't "more correct". Noise dilutes signal and competes for attention.

Fix

  • reduce to the smallest set of chunks that actually answer the question
  • prefer "top 3 excellent chunks" over "top 20 maybe-related chunks"

3) Unusable formatting

If the model can't parse the context, it can't use it.

Fix

  • use clear separators between chunks
  • label sources consistently
  • keep each chunk readable (avoid broken extraction)

Generation Failures

Generation failures mean: the LLM had usable context, but still produced the wrong output.

1) Ignoring retrieved context

Symptoms

  • answer contradicts the chunks
  • answer is generic despite specific context

Common causes

  • weak grounding contract ("use the context" but no enforcement)
  • context is long and unstructured
  • the model falls back to pretraining priors

Fixes

  • strengthen the contract: "Answer ONLY from the provided context"
  • format context as numbered sources
  • reduce temperature; reduce context length; improve ordering

2) Hallucinating despite correct context

Symptoms

  • claims appear that are not supported by any chunk
  • the model "fills gaps" with plausible details

Common causes

  • partial context invites completion
  • context contains noise or ambiguity
  • the prompt allows synthesis without constraints

Fixes

  • enforce refusal: "If it's not in the context, say you don't know"
  • require citations for each claim (forces mapping)
  • rerank harder; trim context; remove boilerplate noise

3) Wrong format or incomplete synthesis

Symptoms

  • technically correct but unusable (missing citations, wrong structure)
  • incomplete answer (only covers one part of the question)

Fixes

  • specify output format explicitly
  • validate output (schema checks, citation checks)
  • split tasks: "extract facts" → "compose answer" (two-step prompts)

Position Effects: Lost in the Middle

Even when you do everything "right", long contexts introduce a specific failure mode:

models don't use all context uniformly.

As context grows, performance can become position-sensitive: information in the middle can be easier to miss than information near the start or end.

Why this matters for RAG

If your best chunk lands in the middle of a long assembled context, the model may underuse it - even if it's present.

Mitigations

  • order chunks by relevance (not document order)
  • keep contexts short and high-signal
  • place the most important chunk first
  • for synthesis-heavy questions, extract a short "supported facts" list (with citations) before composing the final answer
  • for synthesis-heavy questions: summarize retrieved facts first, then answer

The Grounding vs Summarization Tension

RAG often asks for two conflicting behaviors:

  1. Grounding: stick strictly to retrieved content
  2. Summarization: synthesize across multiple sources

Summarization invites interpolation. Interpolation easily becomes hallucination.

How to manage it

  • require citations for claims (not only for the final paragraph)
  • be explicit about allowed synthesis:
    • "combine sources" vs "only report what is explicitly stated"
  • for high-stakes answers, separate steps:
    1. extract supported facts with citations
    2. generate the final response from that fact list

A Debugging Framework (Order Matters)

Step 1: Check retrieval first

  • Are the right chunks in the top-k?
  • Is the answer missing entirely?
  • Is the answer split across chunks?

If retrieval is broken, stop. Fix retrieval.

Step 2: Check context assembly

  • Are chunks ordered by relevance?
  • Is there too much noise?
  • Is formatting parseable?
  • Is the key chunk buried?

If assembly is broken, fix assembly.

Step 3: Check generation last

  • Is the model obeying the grounding contract?
  • Are there unsupported claims?
  • Is format/citation behavior correct?

Only now tune prompts/models.


What to Log Per Query (So Debugging Is Real)

FieldWhy it matters
Query (raw + normalized)lets you reproduce failures
Top-k results (doc_id, chunk_id, offsets)shows whether retrieval is sane
Scores (vector + rerank if present)reveals ranking problems
Final context ordercatches position effects
Output + citationslets you trace claims
"Unsupported claim" flagsshows hallucination leakage

Try This Yourself

Pick one failure your system produces.

  1. Log the retrieved chunks.
  2. Manually check: is the answer in those chunks?
  3. If yes → generation/assembly failure. Fix contract, ordering, formatting.
  4. If no → retrieval failure. Fix chunking, cleaning, search strategy.
  5. Retest the exact same query until it's stable.

That exercise teaches you more than 10 tutorials.


Key Takeaways

  1. Diagnose retrieval vs assembly vs generation before changing anything
  2. Retrieval failures often mean the LLM didn't have a chance - fix retrieval first
  3. Assembly failures are "glue bugs": right chunks, wrong ordering/format/length
  4. Generation failures are contract/behavior issues: enforce grounding and citations
  5. Long context introduces position effects - shorter, better-ordered context wins

Key Terms

  • Retrieval failure: relevant info exists but isn't retrieved
  • Context assembly failure: right chunks retrieved but presented badly
  • Generation failure: right chunks present and usable, but output is wrong
  • Lost in the middle: position sensitivity in long contexts
  • Grounding: constraining output to retrieved context

Further Reading


What's Next

Now you know how RAG breaks. Next question:

How do you know your system is actually working?

In the next post Evaluation for LLM Apps, we'll cover why "it looks good" isn't evaluation - and how to measure RAG and LLM system quality.

Leave a Comment

Comments (0)

Be the first to comment on this post.

Comments are approved automatically.