Skip to main content
Back to Learning Series
LLM Fundamentals • Part 9

Evaluation for LLM Apps

Your RAG system works. The demo goes well.

Then someone asks:

"How do we know it's actually good?"

If your answer is "we tried a few questions and it looked right," you don't have evaluation. You have hope.

This post is about building the evaluation mindset: what to measure, how to separate failure modes, and how to create feedback loops that prevent regressions.


Evaluation Is a Decision Tool

Evaluation exists for one purpose:

So you can make changes without guessing.

You want to answer questions like:

  • Did retrieval improve or get worse?
  • Did hallucinations drop?
  • Did user outcomes improve after shipping this change?
  • What broke when we changed chunking / reranking / model?

Without measurement, every change is a new gamble.


Offline vs Online Evaluation

There are two different problems:

Offline (batch) evaluation

You run a fixed set of queries against your system and score results.

  • good for regressions
  • good for comparing variants
  • cheap and repeatable
  • still doesn't fully capture real user behavior

Online (production) evaluation

You measure what happens to real users.

  • task completion, satisfaction, escalations
  • A/B tests when you can
  • monitoring for drift over time

Rule: Offline eval helps you ship safer. Online eval tells you whether it mattered.


What to Evaluate

LLM apps usually fail in three layers. Measure each separately.

Layer 1: Retrieval quality

For RAG, a good first question is:

Did we retrieve the right chunks?

Common retrieval metrics:

  • Precision@k: of the top-k retrieved chunks, how many are relevant?
  • Recall@k: of all relevant chunks, how many were retrieved in top-k?
  • MRR: how high is the first relevant chunk ranked?

If you want concrete definitions:

  • Precision@k=#relevant in top kk\text{Precision@k}=\frac{\#\text{relevant in top k}}{k}
  • Recall@k=#relevant retrieved in top k#relevant total\text{Recall@k}=\frac{\#\text{relevant retrieved in top k}}{\#\text{relevant total}}
  • MRR=1rank of first relevant result\text{MRR}=\frac{1}{\text{rank of first relevant result}} (or 00 if none are retrieved)

These metrics require relevance labels. You can get them in three ways:

  1. Human labeling (highest quality, expensive)
  2. LLM-as-judge labeling (scalable, needs calibration)
  3. Synthetic query generation (good for coverage, can be unrealistic)

A very practical proxy label for early-stage systems:

"Is the answer present in the retrieved context?"

This won't tell you which chunk was relevant, but it catches the failure mode that matters most: retrieval didn't bring the answer to the model.


Layer 2: Generation quality

Given the retrieved context, did the model behave correctly?

Core generation metrics:

  • Faithfulness / groundedness: are claims supported by the provided context?
  • Answer relevancy: does it answer the question asked?
  • Completeness: does it cover the required parts (not a partial answer)?
  • Format correctness: does it match your required schema/style?

If you only track one thing for RAG generation, track faithfulness. A fluent hallucination is worse than a refusal.


Layer 3: End-to-end product quality

The only metric users care about is: did I get what I needed?

Common product-level signals:

  • task completion rate
  • "needs follow-up" rate (users asking again because the answer wasn't usable)
  • escalation-to-human rate (support cases / handoffs)
  • explicit feedback (thumbs up/down)

These are imperfect, but they're real. Track them over time.


A Useful Baseline: "Answerability"

Before you even score answers, label the query:

Is this answerable from our knowledge base?

Many "hallucination" issues are actually answerability issues:

  • the corpus doesn't contain the answer
  • the answer exists, but not in retrievable form (bad extraction/chunking)
  • the user asked for something outside the system's scope

Adding an explicit answerability classifier (human or LLM) makes your evaluation cleaner:

  • you don't punish the system for missing information it didn't have
  • you separate "not in corpus" from "retrieval failure"

RAGAS (A Practical Library, Not Magic)

If you want a standard starting point for RAG eval, RAGAS is widely used as an open-source library.

It evaluates dimensions like:

MetricWhat it tries to measure
Context PrecisionAre retrieved chunks relevant to the question?
Context RecallDo retrieved chunks contain the information needed to answer?
FaithfulnessAre answer statements supported by retrieved context?
Answer RelevancyDoes the answer address the user's question?

RAGAS uses an LLM to judge these metrics, which makes it scalable - but it's still an approximation.

Use it as instrumentation, not as truth.


LLM-as-Judge (How to Use It Without Lying to Yourself)

LLM-as-judge works best when you treat it like a measurement device:

  • define a rubric
  • keep prompts stable
  • compare variants consistently
  • periodically calibrate against humans

Two evaluation styles:

1) Absolute scoring

"Give this answer a score from 0-1 on faithfulness."

Good for dashboards, but sensitive to prompt phrasing.

2) Pairwise comparison

"Which answer is better, A or B, and why?"

Often more reliable than absolute scores (and closer to how humans evaluate).

A minimal faithfulness rubric you can reuse:

Faithfulness rubric

* 1.0: every factual claim is supported by the provided context
* 0.5: some claims supported, some unsupported / inferred beyond context
* 0.0: key claims unsupported or contradict the context

Important: avoid letting the judge see extra information that the answering model didn't have (like hidden ground truth). Otherwise you measure the judge's ability, not your system.


Human Evaluation (Still Necessary)

You need humans when:

  • stakes are high (medical, legal, finance, policy)
  • domain expertise matters
  • you care about tone, trust, or safety
  • you're validating whether LLM-judge scores correlate with real quality

Make human eval efficient:

  • sample failures and edge cases first
  • use explicit rubrics
  • measure inter-annotator agreement (even informally)

A good pattern is: automated coverage + human calibration.


Building an Evaluation Set (Minimum Viable)

If you have nothing, build this:

  1. A small set of real queries your users will actually ask (start with a few dozen; grow over time)
  2. For each query:
    • answerable? (yes/no)
    • if answerable, what would a "good answer" contain?
  3. Include:
    • ambiguous queries
    • messy natural language
    • exact-keyword cases (SKUs, error codes)
    • known failure cases from production

Then run it on every meaningful change.

Rule: every failure you discover becomes a new eval item. Your eval set should grow from scars.


A Minimal Evaluation Loop You Can Run

  1. Collect a few dozen real queries.
  2. Label answerability.
  3. Run your system and log:
    • retrieved chunks (IDs + sources)
    • final answer
  4. Score:
    • retrieval: "answer present in context?" (yes/no)
    • generation: faithfulness (0/0.5/1)
    • end-to-end: "usable?" (yes/no)
  5. Make one change.
  6. Re-run and compare.

This is enough to catch regressions and guide improvements.


Common Mistakes

  • Evaluating only on happy paths - your system looks great until users arrive
  • Mixing retrieval and generation scores - you can't tell what broke
  • Optimizing judge scores instead of user outcomes - you get good metrics and bad product
  • No baseline - you can't tell whether you improved or only changed
  • No continuous loop - regressions ship silently

Key Takeaways

  1. Evaluation is how you change systems without guessing
  2. Separate retrieval, generation, and product metrics
  3. Track answerability to avoid confusing "not in corpus" with "system failure"
  4. LLM-as-judge scales, but it needs rubrics and human calibration
  5. Your eval set should grow from real failures, not from clean examples

Key Terms

  • Precision@k / Recall@k / MRR: common retrieval metrics
  • Faithfulness: whether claims are supported by retrieved context
  • Answerability: whether the corpus contains enough information to answer
  • LLM-as-judge: using an LLM to score outputs using a rubric
  • RAGAS: open-source evaluation library for RAG systems

Further Reading


What's Next

You can now evaluate RAG systems.

But what about systems that take actions?

In the next post Agents vs Workflows, we'll build the mental model for when autonomy helps vs hurts - and why most production systems are still workflows with LLM steps.

Leave a Comment

Comments (0)

Be the first to comment on this post.

Comments are approved automatically.