Your eval suite is grading the wrong exam. We say this on roughly every other engagement, and we are still surprised by how often it lands as news. The model passes the evals. The application breaks in production. Both of these things are simultaneously true, and they are simultaneously true because the eval suite measures the model and the user is not interacting with the model. They are interacting with the system.
Production AI evals are not the same thing as model evals, and the gap is where most teams lose their first quarter.
What most production AI evals are actually measuring
Walk into any team shipping an LLM feature and ask to see their eval suite. You will, nine times out of ten, see a notebook or a script that takes a fixed list of prompts, runs the current model against them, and scores the outputs on accuracy, faithfulness, or a similar single-axis metric. This is a model eval. It is useful, and it is necessary. It is not sufficient.
The reason it is not sufficient is that the model is one component in a pipeline that includes prompt construction, retrieval, tool calls, post-processing, and rendering. Each of those steps can fail on its own. The model eval will not catch a retrieval miss that returned the wrong document. It will not catch a prompt template that started silently truncating after the indexer rolled to a new version. It will not catch a tool that started returning a different shape after the upstream API was deprecated. The model eval is blind to all of these, and they are the failure modes that actually show up in production.
The other thing the model eval misses is that the user is not asking the questions in your eval set. They are asking adjacent questions, in their own words, with their own context, and the long-tail distribution of those questions is where your application either holds together or does not.
What production AI evals need to be
Three layers. We have not seen a production system hold up that did not have all three.
The first is the model eval, the one most teams already have. Run a fixed evaluation set against the current model and the alternative models you might switch to. This catches model regressions. It tells you nothing about your application.
The second is the system eval. Run end-to-end traffic against the full pipeline, including retrieval, tool calls, the whole thing, and score the final output. This is where you catch prompt template bugs, retrieval misses, post-processing errors, and the long tail of integration failures. Most teams skip this layer because it is harder to build. The teams that ship without it usually ship and then have a bad quarter.
The third is the behavioural eval. This is the layer almost nobody runs. Sample real production traffic, replay it through the system, and score against an LLM-as-judge or a human rater. The thing this catches that the other two miss is drift, both in the input distribution and the model’s response to it. Production AI evals without this layer can pass for months while the application quietly stops being useful.
What to do on Monday
If you do not have a system eval, build it before anything else. Pick a representative slice of production traffic, anonymise it, store it as a fixture, and run the full pipeline against it on every deploy. This is the highest-leverage hour of work the team can do this week. We have seen system evals catch bugs that the model eval missed for six weeks, including one where a vector index was returning embeddings from the previous training run because nobody invalidated the cache after the new index built.
If you have a system eval but not a behavioural eval, add the behavioural layer next. The cost of running a behavioural eval is the cost of the LLM-as-judge calls, which is real but not absurd. Sample at one percent of production traffic to start. Increase the sample as you find the budget.
If you already have all three layers, the next question is whether the evals are actually gating deploys. We have walked into teams who had beautiful eval suites and shipped past failing scores anyway because the eval was advisory. An eval that does not block a deploy is a dashboard.
The blunt version of all of this is that production AI evals are infrastructure work, not data-science work. The team that owns the deploy pipeline should own the evals. We have rarely seen the inverse work for long.
Related: A 2% per-step failure rate becomes 33% failure at 20 steps. Evals are how you catch the per-step failures before they compound.