EO / portfolio
Open to co-lab

Building LLM Evaluation Infrastructure That Scales

“It feels better” is not an evaluation strategy.

As LLM applications move into production, the gap between teams who can iterate confidently and teams who are shooting in the dark increasingly comes down to evaluation infrastructure. This is what I’ve learned building eval systems at two different organisations.

The three layers of LLM evaluation

Good evaluation infrastructure covers three distinct concerns:

  1. Unit evals — fast, cheap, run on every commit; test specific behaviours
  2. Regression evals — compare new model/prompt versions against a baseline
  3. Production monitoring — detect drift, catch failures at scale

Most teams focus only on layer 1. Layers 2 and 3 are where the real leverage is.

Building a judge pipeline

For subjective quality dimensions (coherence, tone, helpfulness), human raters don’t scale. LLM-as-judge has become the standard approach. Key lessons:

Chain-of-thought improves consistency. Ask the judge model to reason before scoring:

judge_prompt = """
Evaluate the following response on helpfulness (1-5).

First, explain your reasoning in 2-3 sentences.
Then output: SCORE: <number>

Response to evaluate:
{response}
"""

Calibrate against human labels. Before trusting automated scores, validate them against a small set of human-labelled examples. A judge with 0.7 Spearman correlation to humans is much more useful than one you haven’t validated.

Infrastructure choices

I’ve converged on this stack for eval infrastructure:

The key insight: eval results are data. Treat them like any other data asset — store them, version them, query them.

The hardest part: deciding what to measure

Technical infrastructure is the easy part. The hard part is deciding which metrics actually matter for your specific application. Invest time defining what “good” looks like before building any infrastructure.