[ llmop ]

What we mean by LLM evaluation in production

How to set up evaluation for an LLM feature so the team can ship changes with confidence, including what to test, what to ignore, and what the operational cadence looks like.

Evaluation is the part of LLM engineering that gets the most lip service and the least real work. Every team agrees that evaluation matters. Most teams do not actually have a working evaluation harness for the LLM features they are shipping, beyond a small set of examples that one engineer remembers to check by hand before pushing prompt changes.

The reason for the gap is partly that evaluation in this context is genuinely harder than evaluation for traditional software. The output is unstructured text. The notion of correctness is fuzzy. The model that the evaluation runs against changes underneath the team. None of these are excuses, but they explain the inertia.

This is the working version of what evaluation actually looks like for an LLM feature in production, and what is worth investing in versus what is busywork.

The test set is the first piece

The simplest piece of an evaluation harness is a set of inputs paired with what the system should produce for those inputs. The set should cover the happy path, the common edge cases, and the long tail of weird inputs that production has surfaced.

The test set has to come from somewhere. The right approach is to seed it from real production traffic. Take a representative sample of what users are actually sending the system. Annotate each one with what the right output would be. Hold the annotated sample as the test set.

The wrong approach is to build the test set from inputs the team imagines users will send. The imagined inputs are systematically different from real inputs, and they miss the long tail that produces most of the failures.

A useful starting size is a few hundred annotated examples. Beyond that, returns diminish quickly for most teams. The first hundred examples catch the high-volume failure modes. The next few hundred catch progressively rarer cases. After that, additional examples are mostly redundant.

The test set should grow over time. Every time production produces a failure, an example of that failure should be added to the test set. The annotation captures what the right output would have been. The team that does this consistently produces a regression test for every customer issue that has ever happened. New changes to the system are checked against the full history of what the system has gotten wrong. That is the evaluation discipline that turns LLM engineering into something tractable.

The metrics that matter

The output of an LLM feature is rarely identical to a fixed expected string. The evaluation has to score the actual output against the expected output in a way that captures what the team cares about.

The right scoring approach depends on the feature.

For features that produce structured output (JSON, function calls, classifications), the scoring is mostly mechanical. The structure is correct or it is not. The values are correct or they are not. Standard tooling handles this case well, and the noise in the score is low.

For features that produce free text, the scoring is harder. The team has a few options. A regex or keyword match on the response, looking for the presence or absence of specific phrases. A semantic similarity score against the expected response. A model-based judge that scores the response against criteria. Each of these is imperfect. A combination is more reliable than any single one.

For features that produce text with a target style, the scoring usually requires a model-based judge. The judge is a separate model call that evaluates whether the response meets the criteria. The judge is itself fallible and benefits from its own evaluation. This is where evaluation work compounds: the team that has a good evaluation harness for its main feature can build a good evaluation harness for its evaluation judge faster.

The metrics to track over time include the percent of test cases passing, the failure rate by category of test case, the latency of each call, the cost of each call, and the count of cases the harness could not score (judge errors, timeouts). The metrics get logged every time the harness runs, with the model version and prompt version that produced them.

Run the harness on every change

The harness has to run reliably and quickly enough that engineers actually run it. A harness that takes thirty minutes to run is a harness that nobody runs before pushing changes.

A reasonable target is that the harness runs in under five minutes against the full test set, with parallel execution against the model. Most teams hit this with simple parallelism and reasonable rate limits.

The harness should run on every change to the prompts, the parsers, the retrieval templates, and the model selection. It should run as part of CI, with results visible on the pull request. The team should treat a regression in the harness the same way they treat a regression in unit tests. Failing harness runs are blocking, with explicit override required to merge.

The harness should also run on a schedule against production. The model vendor changes the model behavior under the team. A weekly run against production traffic catches drift before customers do. Some teams run a small subset of the harness more frequently, every few hours, against a canary route in production.

The evaluation that humans do

Automated evaluation catches a lot. It does not catch everything. Some failure modes only show up when a person reads the output and notices that something is off.

A useful pattern is to schedule a regular human review of a small sample of production output. Once a week, the team takes fifty or a hundred recent production responses and reads them. The reviewers note any responses that look wrong, and those become new test cases.

This sounds expensive and is usually less than two hours of someone's time. The cases the human review surfaces tend to be subtle and high-leverage. The team that does this consistently catches drift and emerging failure modes weeks earlier than the team that relies entirely on automation.

The human review is also a check on the automated harness. If the human reviewers consistently flag responses that the harness rated as passing, the harness is not measuring what the team cares about. That is itself a finding.

When the harness lies

Every evaluation harness eventually produces a misleading result. The team ships a change, the harness says it is better, and production tells a different story. This is normal, and the right response is to investigate the harness rather than to abandon it.

The common reasons the harness lies are predictable. The test set is not representative of production traffic. The metrics do not capture what users actually care about. The model used by the judge has its own biases that affect the score. The cases that matter most have a sample size too small for the metric to detect changes.

Each of these is correctable. The team that takes harness disagreements seriously and updates the harness in response tends to converge to a harness that, after a few quarters, produces results that track production well. The team that argues with the harness tends to stop trusting it, and the harness gradually becomes a check the team performs ritualistically without actually using its output.

The cadence that holds up

A reasonable evaluation cadence for a production LLM feature looks something like this. The harness runs in CI on every change. The harness runs on a schedule against production once a week. A small canary subset runs every few hours. A human review of a sample of production output happens weekly. The test set grows by the new failure cases that production surfaces.

This is not a heavy investment. The first version takes a few weeks to set up. The maintenance is on the order of a few hours per week, distributed across the team. The benefit is the ability to ship LLM changes with the same confidence the team has when shipping changes to the rest of the system.

That confidence is the actual goal. The harness is not a research artifact. It is the operational tool that lets engineering treat LLM features as engineering, not as a continuous demo. The teams that have it ship faster than the teams that do not, with fewer customer-visible failures, with more confident model upgrades. The work to build it is small enough that the gap between the two is mostly a function of when the team decided to start.