Skip to main content
Back to Learning Series
LLM Fundamentals • Part 5

RAG End-to-End: Query to Cited Answer

You've understood embeddings. You know text becomes vectors, and similar vectors cluster together.

Now the question: How do you use this to make LLMs actually know things?

That's what RAG (Retrieval-Augmented Generation) solves. And understanding it end-to-end is what separates "I built a chatbot" from "I understand LLM systems."


What RAG Actually Is

RAG is a pattern, not a product. The core idea:

  1. Store knowledge in a searchable form (usually embedded chunks)
  2. Retrieve relevant pieces when a query comes in
  3. Augment the prompt with that retrieved context
  4. Generate an answer grounded in the retrieved material

The LLM doesn't "learn" your data. It reads it at query time, every time.

Why this matters: RAG lets you use LLMs with private, current, or domain-specific information - without fine-tuning.

This post focuses on how RAG systems work, not on optimizing every stage. We'll cover optimizations in later posts.


The Full RAG Pipeline (Conceptual)

A production RAG system isn't only "embed → search → generate." Here's the conceptual flow:

Query → [Classify] → [Expand] → Retrieve → [Rerank] → Augment → Generate → Cite

Stages in brackets are optimizations - not required for your first RAG system, but important to understand.

Let's walk through each stage.


Stage 1: Query Classification (Optional)

Skip this for your first RAG system. Add it later when optimizing for cost and latency.

Not every query needs retrieval.

Before searching your knowledge base, ask: does this query actually require external context?

Examples that don't need retrieval:

  • "What's 2 + 2?"
  • "Explain what RAG stands for" (general knowledge)
  • Chitchat

Examples that do:

  • "What's our refund policy?"
  • "Summarize last quarter's sales report"

Query classification saves latency and cost - but it's an optimization, not a core requirement.


Stage 2: Query Expansion (Optional)

Skip this for your first RAG system. Add it when retrieval quality becomes the bottleneck.

User queries are often messy, ambiguous, or poorly phrased for semantic search.

Query expansion generates variations or enriched versions of the query:

  • Rewriting: Rephrase for clarity ("refund?" → "What is the refund policy?")
  • Decomposition: Break complex questions into sub-questions
  • HyDE: Generate a hypothetical answer, then search for documents similar to that answer (powerful but adds latency)

Query expansion is an optimization - not a requirement for understanding or building RAG.


Stage 3: Retrieval

This is where embeddings come in.

Basic retrieval:

  1. Embed the query
  2. Compare to embedded chunks in your vector database
  3. Return top-k similar chunks

Hybrid retrieval (increasingly common):

  • Combine vector search (semantic) with keyword search (BM25)
  • Weighted combination of scores
  • Catches cases where exact keywords matter

Many production systems use hybrid retrieval because pure semantic search has blind spots (rare terms, proper nouns, exact phrases).


Stage 4: Reranking (Optional)

Skip for small document collections. Add when you need higher precision.

Initial retrieval is fast but imprecise. Reranking is slower but more accurate.

How it works:

  1. Take the top-k results from retrieval (e.g., top 20)
  2. Run a reranking model that scores each chunk against the query
  3. Return the top-n highest-scoring chunks (e.g., top 5)

Reranking models are typically cross-encoders - they see both query and document together, enabling deeper comparison than embedding similarity.

When to add reranking:

  • Ambiguous queries
  • Need to reduce chunks sent to LLM (cost/context limits)
  • Precision matters more than latency

Stage 5: Augmentation

Now you have your retrieved chunks. Time to build the prompt.

Prompt structure:

<System prompt: role, instructions, constraints>

<Retrieved context: chunk 1, chunk 2, ..., chunk n>

<User query>

Key decisions:

1. Context ordering

  • Long-context behavior can be position-sensitive: models may underuse information buried in the middle of a long prompt.
  • A practical starting point is to order chunks by relevance/reranker score (and keep the total context short).
  • Experiment for your use case.

2. Source attribution

  • Include metadata (source filename, section, page number) in context
  • This enables citations in the response

3. Context length

  • More context = more information, but also more noise and cost
  • Start with a small handful of high-signal chunks (often single digits), then measure.
  • More chunks can improve recall, but can also dilute signal and increase hallucination risk.

Stage 6: Generation

The LLM generates a response using:

  • The system prompt (instructions)
  • The retrieved context (grounding)
  • The user query (task)

System prompt design for grounded answers:

You are a helpful assistant that answers questions based on the provided context.

Rules:
- Only use information from the provided context
- If the context doesn't contain the answer, say "I don't have that information"
- Cite sources using [1], [2] notation matching the context sections
- Do not make up information

This is the retrieval-generation contract: tell the model explicitly what it can and cannot do.


Stage 7: Citation

Users need to verify. Good RAG systems make this easy.

Citation patterns:

  1. Inline citations: "The refund policy is 30 days [1]."
  2. Source list: Include sources at the end with titles/links
  3. Highlighting: Show which chunks were used

The goal: every claim should be traceable to a source.


The Retrieval-Generation Contract

This is the mental model that makes RAG debugging tractable:

ComponentResponsibility
RetrievalFind the right chunks
GenerationSynthesize from those chunks faithfully

When RAG fails, ask: was it a retrieval failure or a generation failure?

  • Retrieval failure: The right chunk wasn't in the top-k results
  • Generation failure: The right chunk was there, but the LLM ignored it or hallucinated anyway

These require different fixes. Don't optimize generation when retrieval is broken.


A Minimal RAG System

Here's what a basic implementation looks like. Framework choice doesn't matter here - this is about the pipeline, not the library.

Note: library APIs and model names change over time; treat this as a conceptual reference and pin versions in real projects.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# 1. Load and chunk documents
loader = PyPDFLoader("your_document.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 2. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 4. Build prompt with context
template = """Answer based only on the following context:

{context}

Question: {question}

If you cannot answer from the context, say "I don't have that information."
"""
prompt = ChatPromptTemplate.from_template(template)

# 5. Generate
llm = ChatOpenAI(model="gpt-4o-mini")

def ask(question):
    docs = retriever.invoke(question)
    context = "\n\n".join([d.page_content for d in docs])
    messages = prompt.invoke({"context": context, "question": question})
    return llm.invoke(messages)

This is ~30 lines. That's a working RAG system.


Debug Checklist: RAG Issues

When your RAG system gives bad answers:

  1. Check retrieval first - Did the right chunks come back? Print them.
  2. Check chunk quality - Are chunks too small/large? Split mid-sentence?
  3. Check embedding match - Is the query embedding similar to relevant chunks?
  4. Check prompt - Is the system prompt clear about using only context?
  5. Check generation - Is the LLM ignoring context? Try temperature 0.
  6. Check citations - Can you trace the answer to a specific source?

Try This Yourself

Experiment 1: Build Minimal RAG

  1. Pick a PDF (company docs, research paper, user manual)
  2. Use the code above (or LlamaIndex equivalent)
  3. Ask 5 questions: 2 that should work, 2 edge cases, 1 completely off-topic
  4. For each: check what chunks were retrieved, then evaluate the answer

Experiment 2: Test the Contract

  1. Ask a question where the answer IS in your documents
  2. Print the retrieved chunks - is the answer there?
  3. If yes but answer is wrong → generation failure
  4. If no → retrieval failure
  5. Fix the right component

Key Takeaways

  1. RAG is a pipeline, not a single step: classify → expand → retrieve → rerank → augment → generate → cite
  2. Query classification avoids unnecessary retrieval
  3. Hybrid retrieval (vectors + keywords) often outperforms vector-only retrieval in domains with exact terms
  4. Reranking improves precision when you can afford the latency
  5. The retrieval-generation contract makes debugging tractable
  6. Citation isn't optional - it's how users verify and trust

Key Terms

TermMeaning
RAGRetrieval-Augmented Generation - pattern of adding retrieved context to LLM prompts
Hybrid SearchCombining vector (semantic) and keyword (BM25) retrieval
RerankingRe-scoring retrieved documents with a more accurate model
Query ExpansionEnriching queries before retrieval (rewriting, decomposition, HyDE)
GroundingConstraining LLM output to information in provided context
Cross-EncoderModel that scores query-document pairs together (used for reranking)

Further Reading


What's Next

You've seen the full pipeline. But where do most RAG systems break?

In the next post, we'll cover Chunking Strategies - why how you split documents matters more than which embedding model you use.

Leave a Comment

Comments (0)

Be the first to comment on this post.

Comments are approved automatically.