Your RAG system works. The demo goes well.
Then someone asks:
"How do we know it's actually good?"
If your answer is "we tried a few questions and it looked right," you don't have evaluation. You have hope.
This post is about building the evaluation mindset: what to measure, how to separate failure modes, and how to create feedback loops that prevent regressions.
Evaluation Is a Decision Tool
Evaluation exists for one purpose:
So you can make changes without guessing.
You want to answer questions like:
- Did retrieval improve or get worse?
- Did hallucinations drop?
- Did user outcomes improve after shipping this change?
- What broke when we changed chunking / reranking / model?
Without measurement, every change is a new gamble.
Offline vs Online Evaluation
There are two different problems:
Offline (batch) evaluation
You run a fixed set of queries against your system and score results.
- good for regressions
- good for comparing variants
- cheap and repeatable
- still doesn't fully capture real user behavior
Online (production) evaluation
You measure what happens to real users.
- task completion, satisfaction, escalations
- A/B tests when you can
- monitoring for drift over time
Rule: Offline eval helps you ship safer. Online eval tells you whether it mattered.
What to Evaluate
LLM apps usually fail in three layers. Measure each separately.
Layer 1: Retrieval quality
For RAG, a good first question is:
Did we retrieve the right chunks?
Common retrieval metrics:
- Precision@k: of the top-k retrieved chunks, how many are relevant?
- Recall@k: of all relevant chunks, how many were retrieved in top-k?
- MRR: how high is the first relevant chunk ranked?
If you want concrete definitions:
- (or if none are retrieved)
These metrics require relevance labels. You can get them in three ways:
- Human labeling (highest quality, expensive)
- LLM-as-judge labeling (scalable, needs calibration)
- Synthetic query generation (good for coverage, can be unrealistic)
A very practical proxy label for early-stage systems:
"Is the answer present in the retrieved context?"
This won't tell you which chunk was relevant, but it catches the failure mode that matters most: retrieval didn't bring the answer to the model.
Layer 2: Generation quality
Given the retrieved context, did the model behave correctly?
Core generation metrics:
- Faithfulness / groundedness: are claims supported by the provided context?
- Answer relevancy: does it answer the question asked?
- Completeness: does it cover the required parts (not a partial answer)?
- Format correctness: does it match your required schema/style?
If you only track one thing for RAG generation, track faithfulness. A fluent hallucination is worse than a refusal.
Layer 3: End-to-end product quality
The only metric users care about is: did I get what I needed?
Common product-level signals:
- task completion rate
- "needs follow-up" rate (users asking again because the answer wasn't usable)
- escalation-to-human rate (support cases / handoffs)
- explicit feedback (thumbs up/down)
These are imperfect, but they're real. Track them over time.
A Useful Baseline: "Answerability"
Before you even score answers, label the query:
Is this answerable from our knowledge base?
Many "hallucination" issues are actually answerability issues:
- the corpus doesn't contain the answer
- the answer exists, but not in retrievable form (bad extraction/chunking)
- the user asked for something outside the system's scope
Adding an explicit answerability classifier (human or LLM) makes your evaluation cleaner:
- you don't punish the system for missing information it didn't have
- you separate "not in corpus" from "retrieval failure"
RAGAS (A Practical Library, Not Magic)
If you want a standard starting point for RAG eval, RAGAS is widely used as an open-source library.
It evaluates dimensions like:
| Metric | What it tries to measure |
|---|---|
| Context Precision | Are retrieved chunks relevant to the question? |
| Context Recall | Do retrieved chunks contain the information needed to answer? |
| Faithfulness | Are answer statements supported by retrieved context? |
| Answer Relevancy | Does the answer address the user's question? |
RAGAS uses an LLM to judge these metrics, which makes it scalable - but it's still an approximation.
Use it as instrumentation, not as truth.
LLM-as-Judge (How to Use It Without Lying to Yourself)
LLM-as-judge works best when you treat it like a measurement device:
- define a rubric
- keep prompts stable
- compare variants consistently
- periodically calibrate against humans
Two evaluation styles:
1) Absolute scoring
"Give this answer a score from 0-1 on faithfulness."
Good for dashboards, but sensitive to prompt phrasing.
2) Pairwise comparison
"Which answer is better, A or B, and why?"
Often more reliable than absolute scores (and closer to how humans evaluate).
A minimal faithfulness rubric you can reuse:
Faithfulness rubric
* 1.0: every factual claim is supported by the provided context
* 0.5: some claims supported, some unsupported / inferred beyond context
* 0.0: key claims unsupported or contradict the context
Important: avoid letting the judge see extra information that the answering model didn't have (like hidden ground truth). Otherwise you measure the judge's ability, not your system.
Human Evaluation (Still Necessary)
You need humans when:
- stakes are high (medical, legal, finance, policy)
- domain expertise matters
- you care about tone, trust, or safety
- you're validating whether LLM-judge scores correlate with real quality
Make human eval efficient:
- sample failures and edge cases first
- use explicit rubrics
- measure inter-annotator agreement (even informally)
A good pattern is: automated coverage + human calibration.
Building an Evaluation Set (Minimum Viable)
If you have nothing, build this:
- A small set of real queries your users will actually ask (start with a few dozen; grow over time)
- For each query:
- answerable? (yes/no)
- if answerable, what would a "good answer" contain?
- Include:
- ambiguous queries
- messy natural language
- exact-keyword cases (SKUs, error codes)
- known failure cases from production
Then run it on every meaningful change.
Rule: every failure you discover becomes a new eval item. Your eval set should grow from scars.
A Minimal Evaluation Loop You Can Run
- Collect a few dozen real queries.
- Label answerability.
- Run your system and log:
- retrieved chunks (IDs + sources)
- final answer
- Score:
- retrieval: "answer present in context?" (yes/no)
- generation: faithfulness (0/0.5/1)
- end-to-end: "usable?" (yes/no)
- Make one change.
- Re-run and compare.
This is enough to catch regressions and guide improvements.
Common Mistakes
- Evaluating only on happy paths - your system looks great until users arrive
- Mixing retrieval and generation scores - you can't tell what broke
- Optimizing judge scores instead of user outcomes - you get good metrics and bad product
- No baseline - you can't tell whether you improved or only changed
- No continuous loop - regressions ship silently
Key Takeaways
- Evaluation is how you change systems without guessing
- Separate retrieval, generation, and product metrics
- Track answerability to avoid confusing "not in corpus" with "system failure"
- LLM-as-judge scales, but it needs rubrics and human calibration
- Your eval set should grow from real failures, not from clean examples
Key Terms
- Precision@k / Recall@k / MRR: common retrieval metrics
- Faithfulness: whether claims are supported by retrieved context
- Answerability: whether the corpus contains enough information to answer
- LLM-as-judge: using an LLM to score outputs using a rubric
- RAGAS: open-source evaluation library for RAG systems
Further Reading
- RAGAS documentation: https://docs.ragas.io/
- “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena” (Zheng et al., 2023): https://arxiv.org/abs/2306.05685
What's Next
You can now evaluate RAG systems.
But what about systems that take actions?
In the next post Agents vs Workflows, we'll build the mental model for when autonomy helps vs hurts - and why most production systems are still workflows with LLM steps.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.