The Open-Book Exam Analogy
Imagine two exam scenarios:
Closed-book exam: You try to answer from memory. If you forgot a fact, you might guess or get it wrong.
Open-book exam: You can look up information in your notes before answering. Your answers are more accurate and grounded in real sources.
RAG (Retrieval-Augmented Generation) gives AI an open-book exam.
Instead of relying solely on what the model learned during training, RAG retrieves relevant documents first, then generates answers based on that retrieved context. The result: more accurate, up-to-date, and verifiable responses.
How RAG Actually Works
RAG has three main steps: Index, Retrieve, Generate.
The RAG Pipeline
User Query
│
▼
┌─────────────────┐
│ 1. EMBED QUERY │ ← Convert query to vector
└────────┬────────┘
│
▼
┌─────────────────┐
│ 2. RETRIEVE │ ← Find similar documents
│ (Vector Search)│
└────────┬────────┘
│
▼
┌─────────────────┐
│ 3. GENERATE │ ← LLM answers with context
│ (LLM + Context)│
└─────────────────┘
Step 1: Indexing (Done Once)
Before RAG can work, you need to prepare your documents:
# Split documents into chunks
chunks = split_into_chunks(documents, chunk_size=500)
# Convert each chunk to an embedding vector
embeddings = embedding_model.encode(chunks)
# Store in a vector database
vector_db.insert(chunks, embeddings)
Step 2: Retrieval (Per Query)
When a user asks a question:
# Convert the query to a vector
query_embedding = embedding_model.encode(user_query)
# Find the most similar document chunks
relevant_chunks = vector_db.search(query_embedding, top_k=5)
Step 3: Generation (Per Query)
Send the retrieved context to the LLM:
prompt = f"""
Answer the question based on the following context.
If the context doesn't contain the answer, say "I don't know."
Context:
{relevant_chunks}
Question: {user_query}
"""
response = llm.generate(prompt)
Why RAG Matters
| Problem with plain LLMs | How RAG solves it |
|---|---|
| Outdated knowledge (training cutoff) | Retrieves current documents |
| Hallucinations | Grounds answers in sources |
| No access to private data | Searches your own documents |
| Generic answers | Provides company-specific info |
| Hard to verify | Can cite sources |
Real-World Examples
1. Customer Support Bot
# User asks about refund policy
query = "What is your refund policy?"
# RAG retrieves relevant policy documents
context = retrieve(query)
# -> "Refunds are available within the stated window..."
# LLM generates grounded answer
answer = generate(query, context)
# -> "You can get a full refund within the stated window after purchase."
2. Code Documentation Search
# Developer asks about a function
query = "How do I use the authenticate() function?"
# RAG finds relevant docstrings and examples
context = retrieve(query, collection="codebase")
# LLM explains with actual code examples from the docs
answer = generate(query, context)
3. Legal Document Analysis
# Lawyer asks about contract terms
query = "What are the termination clauses?"
# RAG searches through the contract
context = retrieve(query, collection="contracts")
# LLM summarizes relevant sections
answer = generate(query, context)
Common Mistakes and Gotchas
Chunk Size Too Large or Too Small
- Too large: Irrelevant information dilutes the context
- Too small: Important context gets split across chunks
Start with a moderate chunk size and a small overlap, then adjust based on retrieval quality.
Not Including Metadata
Store metadata (source, date, section) with chunks for filtering and citations:
{
"text": "Refunds available within the stated window...",
"source": "refund_policy.pdf",
"page": 2,
"updated": "YYYY-MM-DD"
}
Ignoring Retrieved Document Quality
If retrieval returns irrelevant documents, the LLM can generate poor answers. It often helps to:
- Check retrieval quality first
- Use reranking to improve relevance
- Filter by metadata when appropriate
Stuffing Too Much Context
LLMs have context limits. Retrieving 50 documents and cramming them all in doesn't work. Retrieve 3-5 highly relevant chunks instead.
RAG vs Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge update | Instant (update documents) | Requires retraining |
| Cost | Lower (no training) | Higher (GPU time) |
| Private data | Easy to add | Needs careful training |
| Hallucinations | Reduced (grounded) | Still possible |
| Often used for | Facts, docs, search | Style, format, behavior |
Use RAG when: You need current, factual, verifiable answers from your own data.
Use fine-tuning when: You need to change how the model writes or behaves, not what it knows.
FAQ
Q: What embedding model should I use?
Common options include managed embedding APIs or open-source sentence-transformer style models. Choose based on quality, cost, latency, and whether you need local hosting.
Q: What vector database should I use?
Options include Pinecone (managed), Weaviate (open-source), Qdrant (open-source), Chroma (lightweight), and pgvector (PostgreSQL extension). Start with Chroma for prototyping.
Q: How do I handle multiple document types?
Convert everything to text first. Use libraries like unstructured, PyPDF2, or docx to extract text from PDFs, Word docs, and HTML. Store the source type in metadata.
Q: Can RAG work with images or videos?
Yes, with multimodal embeddings. Models like CLIP can embed images alongside text. For videos, extract keyframes and transcripts first.
Q: How do I evaluate RAG quality?
Measure retrieval (precision, recall, MRR) and generation (faithfulness, relevance). Tools like Ragas, TruLens, and LangSmith help automate evaluation.
Q: What is hybrid search?
Combining vector search with keyword search. Vector search finds semantically similar content, while keyword search catches exact matches. Many vector databases support both.
Summary
RAG transforms LLMs from "students reciting from memory" into "researchers with access to a library." It's a practical way to build AI applications that need accurate, current, and verifiable information.
Key Points:
- RAG = Retrieve relevant documents, then generate with context
- Requires: embedding model, vector database, LLM
- Reduces hallucinations by grounding answers in sources
- Knowledge updates are instant (just update documents)
- Chunk size and retrieval quality are critical
- Often a better fit than fine-tuning for factual, up-to-date information
If you're building an AI application that needs to answer questions about specific documents or data, RAG is often a good fit.
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.