You've understood embeddings. You know text becomes vectors, and similar vectors cluster together.
Now the question: How do you use this to make LLMs actually know things?
That's what RAG (Retrieval-Augmented Generation) solves. And understanding it end-to-end is what separates "I built a chatbot" from "I understand LLM systems."
What RAG Actually Is
RAG is a pattern, not a product. The core idea:
- Store knowledge in a searchable form (usually embedded chunks)
- Retrieve relevant pieces when a query comes in
- Augment the prompt with that retrieved context
- Generate an answer grounded in the retrieved material
The LLM doesn't "learn" your data. It reads it at query time, every time.
Why this matters: RAG lets you use LLMs with private, current, or domain-specific information - without fine-tuning.
This post focuses on how RAG systems work, not on optimizing every stage. We'll cover optimizations in later posts.
The Full RAG Pipeline (Conceptual)
A production RAG system isn't only "embed → search → generate." Here's the conceptual flow:
Query → [Classify] → [Expand] → Retrieve → [Rerank] → Augment → Generate → Cite
Stages in brackets are optimizations - not required for your first RAG system, but important to understand.
Let's walk through each stage.
Stage 1: Query Classification (Optional)
Skip this for your first RAG system. Add it later when optimizing for cost and latency.
Not every query needs retrieval.
Before searching your knowledge base, ask: does this query actually require external context?
Examples that don't need retrieval:
- "What's 2 + 2?"
- "Explain what RAG stands for" (general knowledge)
- Chitchat
Examples that do:
- "What's our refund policy?"
- "Summarize last quarter's sales report"
Query classification saves latency and cost - but it's an optimization, not a core requirement.
Stage 2: Query Expansion (Optional)
Skip this for your first RAG system. Add it when retrieval quality becomes the bottleneck.
User queries are often messy, ambiguous, or poorly phrased for semantic search.
Query expansion generates variations or enriched versions of the query:
- Rewriting: Rephrase for clarity ("refund?" → "What is the refund policy?")
- Decomposition: Break complex questions into sub-questions
- HyDE: Generate a hypothetical answer, then search for documents similar to that answer (powerful but adds latency)
Query expansion is an optimization - not a requirement for understanding or building RAG.
Stage 3: Retrieval
This is where embeddings come in.
Basic retrieval:
- Embed the query
- Compare to embedded chunks in your vector database
- Return top-k similar chunks
Hybrid retrieval (increasingly common):
- Combine vector search (semantic) with keyword search (BM25)
- Weighted combination of scores
- Catches cases where exact keywords matter
Many production systems use hybrid retrieval because pure semantic search has blind spots (rare terms, proper nouns, exact phrases).
Stage 4: Reranking (Optional)
Skip for small document collections. Add when you need higher precision.
Initial retrieval is fast but imprecise. Reranking is slower but more accurate.
How it works:
- Take the top-k results from retrieval (e.g., top 20)
- Run a reranking model that scores each chunk against the query
- Return the top-n highest-scoring chunks (e.g., top 5)
Reranking models are typically cross-encoders - they see both query and document together, enabling deeper comparison than embedding similarity.
When to add reranking:
- Ambiguous queries
- Need to reduce chunks sent to LLM (cost/context limits)
- Precision matters more than latency
Stage 5: Augmentation
Now you have your retrieved chunks. Time to build the prompt.
Prompt structure:
<System prompt: role, instructions, constraints>
<Retrieved context: chunk 1, chunk 2, ..., chunk n>
<User query>
Key decisions:
1. Context ordering
- Long-context behavior can be position-sensitive: models may underuse information buried in the middle of a long prompt.
- A practical starting point is to order chunks by relevance/reranker score (and keep the total context short).
- Experiment for your use case.
2. Source attribution
- Include metadata (source filename, section, page number) in context
- This enables citations in the response
3. Context length
- More context = more information, but also more noise and cost
- Start with a small handful of high-signal chunks (often single digits), then measure.
- More chunks can improve recall, but can also dilute signal and increase hallucination risk.
Stage 6: Generation
The LLM generates a response using:
- The system prompt (instructions)
- The retrieved context (grounding)
- The user query (task)
System prompt design for grounded answers:
You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only use information from the provided context
- If the context doesn't contain the answer, say "I don't have that information"
- Cite sources using [1], [2] notation matching the context sections
- Do not make up information
This is the retrieval-generation contract: tell the model explicitly what it can and cannot do.
Stage 7: Citation
Users need to verify. Good RAG systems make this easy.
Citation patterns:
- Inline citations: "The refund policy is 30 days [1]."
- Source list: Include sources at the end with titles/links
- Highlighting: Show which chunks were used
The goal: every claim should be traceable to a source.
The Retrieval-Generation Contract
This is the mental model that makes RAG debugging tractable:
| Component | Responsibility |
|---|---|
| Retrieval | Find the right chunks |
| Generation | Synthesize from those chunks faithfully |
When RAG fails, ask: was it a retrieval failure or a generation failure?
- Retrieval failure: The right chunk wasn't in the top-k results
- Generation failure: The right chunk was there, but the LLM ignored it or hallucinated anyway
These require different fixes. Don't optimize generation when retrieval is broken.
A Minimal RAG System
Here's what a basic implementation looks like. Framework choice doesn't matter here - this is about the pipeline, not the library.
Note: library APIs and model names change over time; treat this as a conceptual reference and pin versions in real projects.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# 1. Load and chunk documents
loader = PyPDFLoader("your_document.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 2. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 4. Build prompt with context
template = """Answer based only on the following context:
{context}
Question: {question}
If you cannot answer from the context, say "I don't have that information."
"""
prompt = ChatPromptTemplate.from_template(template)
# 5. Generate
llm = ChatOpenAI(model="gpt-4o-mini")
def ask(question):
docs = retriever.invoke(question)
context = "\n\n".join([d.page_content for d in docs])
messages = prompt.invoke({"context": context, "question": question})
return llm.invoke(messages)
This is ~30 lines. That's a working RAG system.
Debug Checklist: RAG Issues
When your RAG system gives bad answers:
- Check retrieval first - Did the right chunks come back? Print them.
- Check chunk quality - Are chunks too small/large? Split mid-sentence?
- Check embedding match - Is the query embedding similar to relevant chunks?
- Check prompt - Is the system prompt clear about using only context?
- Check generation - Is the LLM ignoring context? Try temperature 0.
- Check citations - Can you trace the answer to a specific source?
Try This Yourself
Experiment 1: Build Minimal RAG
- Pick a PDF (company docs, research paper, user manual)
- Use the code above (or LlamaIndex equivalent)
- Ask 5 questions: 2 that should work, 2 edge cases, 1 completely off-topic
- For each: check what chunks were retrieved, then evaluate the answer
Experiment 2: Test the Contract
- Ask a question where the answer IS in your documents
- Print the retrieved chunks - is the answer there?
- If yes but answer is wrong → generation failure
- If no → retrieval failure
- Fix the right component
Key Takeaways
- RAG is a pipeline, not a single step: classify → expand → retrieve → rerank → augment → generate → cite
- Query classification avoids unnecessary retrieval
- Hybrid retrieval (vectors + keywords) often outperforms vector-only retrieval in domains with exact terms
- Reranking improves precision when you can afford the latency
- The retrieval-generation contract makes debugging tractable
- Citation isn't optional - it's how users verify and trust
Key Terms
| Term | Meaning |
|---|---|
| RAG | Retrieval-Augmented Generation - pattern of adding retrieved context to LLM prompts |
| Hybrid Search | Combining vector (semantic) and keyword (BM25) retrieval |
| Reranking | Re-scoring retrieved documents with a more accurate model |
| Query Expansion | Enriching queries before retrieval (rewriting, decomposition, HyDE) |
| Grounding | Constraining LLM output to information in provided context |
| Cross-Encoder | Model that scores query-document pairs together (used for reranking) |
Further Reading
- Retrieval-Augmented Generation (Lewis et al., 2020): https://arxiv.org/abs/2005.11401
- Long-context position effects (“Lost in the Middle”, Liu et al., 2023): https://arxiv.org/abs/2307.03172
What's Next
You've seen the full pipeline. But where do most RAG systems break?
In the next post, we'll cover Chunking Strategies - why how you split documents matters more than which embedding model you use.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.