The cleanest way we know to kill an agentic system in production is to give it twenty steps and a two percent per-step failure rate. The math closes the door on its own. One in three runs fails, and by the time the team notices, the demo budget has already been spent on the demo.
This is not a model-quality problem. The model in your demo and the model in your incident are usually the same model. What changed is that the pipeline got longer and nobody re-did the arithmetic.
Retries are the wrong reflex
The first thing every team reaches for is retry logic, and we don’t blame them. Retries are cheap to ship, they look like they work, and the failed-step graph gets nicer overnight. The problem is what happens to the rest of the system after the retries land.
We’ve watched the same three things go wrong, in this order. First, an idempotency bug ships, because the original call succeeded, the response was dropped, the retry committed the side effect a second time, and now a customer somewhere is double-billed. Second, a downstream service that was already in distress gets a thundering herd, because the agent doesn’t know its tool is timing out for a reason. Third, the inference bill doubles, and finance flags it before the product team wants to admit the architecture is the problem.
Retries are still useful when they’re bounded, scoped to one step, and gated by an idempotency key. Past that, they are not a per-step failure rate fix. They are a way to disguise it for a quarter.
The per-step failure rate is the only number that compounds
The whole problem is one line of arithmetic. A pipeline succeeds only if every step succeeds, so the success rate is (1 minus p) to the power n, where p is the per-step failure rate and n is the number of steps. That is the formula. It is not interesting. The interesting part is what it does to your demo.
At p = 2% and n = 5, which is the demo case, you succeed 90% of the time. The room sees a system that works nine out of ten tries. Heads nod. The slide deck calls it ready. At n = 20, which is the pipeline you actually shipped, you succeed 67% of the time, which is to say one in three users gets a broken response and a story to tell. The model did not get worse between demo and production. You added fifteen places where the model could be wrong.
The interesting thing about this curve is how nonlinear the win is at the low end. Halving p from 2% to 1% takes a twenty-step pipeline from 67% to 82%. Halving it again to 0.5% gets you to 90%. Each factor of two in per-step failure roughly doubles the headroom you have to add steps. This is why the teams that ship reliable agentic systems spend their time grinding on individual step reliability and almost no time on prompt cleverness for the orchestrator. The orchestrator is the easy part.
What we do when we ship one of these
Three moves, in order. None of them are clever; the absence of all three is what we usually find on an engagement.
Measure first. Most teams we walk into cannot tell us the per-step failure rate of any step in their pipeline. They can tell us the end-to-end success rate (sometimes), and a feeling that “step four is flaky” (usually). Run every step against a representative slice of production traffic before integrating it. Pick a target. We use 0.5%, which is aggressive enough to be useful and loose enough to be achievable. Refuse to integrate a step that misses it. The first time a team does this they cut three steps from the pipeline because the per-step cost wasn’t worth the marginal value. Good. That’s the point.
Collapse where you can. A planning step followed by an execution step is two failure points. A single step that emits a plan and the first execution as one call is one. The inference cost is roughly the same. The reliability is squared in your favour. This is the single highest-leverage thing we’ve seen teams do, and it is unfashionable because the agent-framework literature loves a graph of fine-grained nodes.
Checkpoint everything irreversible. Pick a workflow engine. We default to Temporal because the SDK ergonomics are good and the failure semantics are well-documented; AWS Step Functions is fine if you’re already deep in AWS and don’t mind the JSON state-machine syntax. The point is not which engine. The point is that an engine, not your Python script, owns the retry budget, persists state between steps, and resumes from the last checkpoint when something downstream fails. A hand-rolled retry loop will eventually do something stupid. The engine will not.
Put those three moves together and a 20-step pipeline at 67% becomes a 5-checkpoint pipeline at something north of 95%. The arithmetic hasn’t changed. The architecture is doing the work the model was being asked to do and could not.
The first question we ask now, on any engagement that involves an agentic system, is what the per-step failure rate is. The honest answer is usually “we haven’t measured.” Sometimes the answer is a number, and the number is wrong because the team counted only hard failures and not the partial-success cases that quietly poison downstream steps. Once, last year, the answer came back as 0.3% with a graph. That team was fine. We’re still figuring out how to get more teams to that kind of answer faster than six months in.
Related: Schema drift is a contract failure, not a pipeline failure. The same shift-left logic, applied one layer down.