“It feels better” is not an evaluation strategy.
As LLM applications move into production, the gap between teams who can iterate confidently and teams who are shooting in the dark increasingly comes down to evaluation infrastructure. This is what I’ve learned building eval systems at two different organisations.
The three layers of LLM evaluation
Good evaluation infrastructure covers three distinct concerns:
- Unit evals — fast, cheap, run on every commit; test specific behaviours
- Regression evals — compare new model/prompt versions against a baseline
- Production monitoring — detect drift, catch failures at scale
Most teams focus only on layer 1. Layers 2 and 3 are where the real leverage is.
Building a judge pipeline
For subjective quality dimensions (coherence, tone, helpfulness), human raters don’t scale. LLM-as-judge has become the standard approach. Key lessons:
Chain-of-thought improves consistency. Ask the judge model to reason before scoring:
judge_prompt = """
Evaluate the following response on helpfulness (1-5).
First, explain your reasoning in 2-3 sentences.
Then output: SCORE: <number>
Response to evaluate:
{response}
"""
Calibrate against human labels. Before trusting automated scores, validate them against a small set of human-labelled examples. A judge with 0.7 Spearman correlation to humans is much more useful than one you haven’t validated.
Infrastructure choices
I’ve converged on this stack for eval infrastructure:
- LangSmith or Weights & Biases Weave for experiment tracking
- pytest for unit evals (fast feedback loop)
- Postgres + a thin API for storing eval results over time
- Grafana for production monitoring dashboards
The key insight: eval results are data. Treat them like any other data asset — store them, version them, query them.
The hardest part: deciding what to measure
Technical infrastructure is the easy part. The hard part is deciding which metrics actually matter for your specific application. Invest time defining what “good” looks like before building any infrastructure.