Skip to main content

📚 RAG

An open-book exam for AI

The Open-Book Exam Analogy

Imagine two exam scenarios:

Closed-book exam: You try to answer from memory. If you forgot a fact, you might guess or get it wrong.

Open-book exam: You can look up information in your notes before answering. Your answers are more accurate and grounded in real sources.

RAG (Retrieval-Augmented Generation) gives AI an open-book exam.

Instead of relying solely on what the model learned during training, RAG retrieves relevant documents first, then generates answers based on that retrieved context. The result: more accurate, up-to-date, and verifiable responses.


How RAG Actually Works

RAG has three main steps: Index, Retrieve, Generate.

The RAG Pipeline

User Query
    │
    ▼
┌─────────────────┐
│  1. EMBED QUERY │ ← Convert query to vector
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. RETRIEVE    │ ← Find similar documents
│  (Vector Search)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. GENERATE    │ ← LLM answers with context
│  (LLM + Context)│
└─────────────────┘

Step 1: Indexing (Done Once)

Before RAG can work, you need to prepare your documents:

# Split documents into chunks
chunks = split_into_chunks(documents, chunk_size=500)

# Convert each chunk to an embedding vector
embeddings = embedding_model.encode(chunks)

# Store in a vector database
vector_db.insert(chunks, embeddings)

Step 2: Retrieval (Per Query)

When a user asks a question:

# Convert the query to a vector
query_embedding = embedding_model.encode(user_query)

# Find the most similar document chunks
relevant_chunks = vector_db.search(query_embedding, top_k=5)

Step 3: Generation (Per Query)

Send the retrieved context to the LLM:

prompt = f"""
Answer the question based on the following context.
If the context doesn't contain the answer, say "I don't know."

Context:
{relevant_chunks}

Question: {user_query}
"""

response = llm.generate(prompt)

Why RAG Matters

Problem with plain LLMsHow RAG solves it
Outdated knowledge (training cutoff)Retrieves current documents
HallucinationsGrounds answers in sources
No access to private dataSearches your own documents
Generic answersProvides company-specific info
Hard to verifyCan cite sources

Real-World Examples

1. Customer Support Bot

# User asks about refund policy
query = "What is your refund policy?"

# RAG retrieves relevant policy documents
context = retrieve(query)
# -> "Refunds are available within the stated window..."

# LLM generates grounded answer
answer = generate(query, context)
# -> "You can get a full refund within the stated window after purchase."
# Developer asks about a function
query = "How do I use the authenticate() function?"

# RAG finds relevant docstrings and examples
context = retrieve(query, collection="codebase")

# LLM explains with actual code examples from the docs
answer = generate(query, context)
# Lawyer asks about contract terms
query = "What are the termination clauses?"

# RAG searches through the contract
context = retrieve(query, collection="contracts")

# LLM summarizes relevant sections
answer = generate(query, context)

Common Mistakes and Gotchas

Chunk Size Too Large or Too Small

  • Too large: Irrelevant information dilutes the context
  • Too small: Important context gets split across chunks

Start with a moderate chunk size and a small overlap, then adjust based on retrieval quality.

Not Including Metadata

Store metadata (source, date, section) with chunks for filtering and citations:

{
  "text": "Refunds available within the stated window...",
  "source": "refund_policy.pdf",
  "page": 2,
  "updated": "YYYY-MM-DD"
}

Ignoring Retrieved Document Quality

If retrieval returns irrelevant documents, the LLM can generate poor answers. It often helps to:

  • Check retrieval quality first
  • Use reranking to improve relevance
  • Filter by metadata when appropriate

Stuffing Too Much Context

LLMs have context limits. Retrieving 50 documents and cramming them all in doesn't work. Retrieve 3-5 highly relevant chunks instead.


RAG vs Fine-Tuning

AspectRAGFine-Tuning
Knowledge updateInstant (update documents)Requires retraining
CostLower (no training)Higher (GPU time)
Private dataEasy to addNeeds careful training
HallucinationsReduced (grounded)Still possible
Often used forFacts, docs, searchStyle, format, behavior

Use RAG when: You need current, factual, verifiable answers from your own data.

Use fine-tuning when: You need to change how the model writes or behaves, not what it knows.


FAQ

Q: What embedding model should I use?

Common options include managed embedding APIs or open-source sentence-transformer style models. Choose based on quality, cost, latency, and whether you need local hosting.

Q: What vector database should I use?

Options include Pinecone (managed), Weaviate (open-source), Qdrant (open-source), Chroma (lightweight), and pgvector (PostgreSQL extension). Start with Chroma for prototyping.

Q: How do I handle multiple document types?

Convert everything to text first. Use libraries like unstructured, PyPDF2, or docx to extract text from PDFs, Word docs, and HTML. Store the source type in metadata.

Q: Can RAG work with images or videos?

Yes, with multimodal embeddings. Models like CLIP can embed images alongside text. For videos, extract keyframes and transcripts first.

Q: How do I evaluate RAG quality?

Measure retrieval (precision, recall, MRR) and generation (faithfulness, relevance). Tools like Ragas, TruLens, and LangSmith help automate evaluation.

Combining vector search with keyword search. Vector search finds semantically similar content, while keyword search catches exact matches. Many vector databases support both.


Summary

RAG transforms LLMs from "students reciting from memory" into "researchers with access to a library." It's a practical way to build AI applications that need accurate, current, and verifiable information.

Key Points:

  • RAG = Retrieve relevant documents, then generate with context
  • Requires: embedding model, vector database, LLM
  • Reduces hallucinations by grounding answers in sources
  • Knowledge updates are instant (just update documents)
  • Chunk size and retrieval quality are critical
  • Often a better fit than fine-tuning for factual, up-to-date information

If you're building an AI application that needs to answer questions about specific documents or data, RAG is often a good fit.

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.