Back to blog
#failure-handling#pipelines#error-routing

The failure routing nobody talks about

When step 3 fails, most people retry step 3. But what if the problem started at step 2? That's failure routing.

The failure routing nobody talks about

The wrong instinct

When something fails, you retry it. That's the instinct. Server returns a 500, retry the request. Flaky test fails, run it again. It's so deeply wired into how we think about errors that most people never question it.

But AI agent pipelines aren't HTTP requests. When step 3 fails, the problem usually isn't in step 3. The problem is that step 2 produced bad output and step 3 inherited it. Retrying step 3 with the same bad input will produce the same bad result. Every time.

This is the failure routing problem. And almost nobody talks about it because it looks like a retry problem from the outside.

A real example

January 2025. I had a three-step pipeline: pre-check reads the codebase, implement writes the code, validate checks the result. The task was "add a caching layer to the search endpoint."

Validate failed. The agent had implemented caching using a Redis client, but my project didn't have Redis. No Redis dependency, no Redis connection string, no Redis anywhere. The agent invented a dependency that didn't exist.

pre-check  ✓  (read the codebase)
implement  ✓  (wrote code using Redis)
validate   ✗  ("redis package not found in dependencies")

Now here's the question. Where do you retry?

If you retry validate, nothing changes. The code still imports Redis. If you retry implement, the agent might write the same thing because it still thinks Redis is fine. The actual failure point was pre-check. The pre-check read the codebase but didn't flag "no caching infrastructure exists." It should have noted that and constrained the implementation to use in-memory caching or the existing database.

The fix was going back to pre-check with the error: "Implementation used Redis but no Redis dependency exists. Re-evaluate the caching approach given the current stack."

pre-check  ✓  (read the codebase)
implement  ✓  (wrote code using Redis)
validate   ✗  ("redis package not found")
  → route back to pre-check with error
pre-check  ✓  (now flags: "use in-memory cache, no Redis available")
implement  ✓  (writes LRU cache with Map)
validate   ✓

That's failure routing. Not "retry the failed step" but "go back to the step that can actually fix the problem."

Why this is hard to get right

The tricky part is that the correct routing target changes depending on the error.

Sometimes validate fails because the implementation has a bug. That's a step 2 problem. Route back to implement with the validation error. The agent can fix the bug without re-reading the codebase.

Sometimes validate fails because the implementation approach was wrong. That's a step 1 problem. Route back to pre-check so the agent can reconsider the approach. Going to implement would just produce another flawed implementation based on the same flawed plan.

And sometimes validate fails because the task itself is ambiguous. That's a step 0 problem. No amount of routing will fix a badly written task. The pipeline should flag it for human review and move on.

Error type              → Route to
────────────────────────────────────
Bug in implementation   → implement (with error)
Wrong approach          → pre-check (with error)
Ambiguous task          → skip, flag for review
Flaky/transient error   → retry same step (1-2 times)

In the early days of nightloop.sh, I didn't have any of this. Everything was a simple retry. Three attempts, same step, then fail. I'd wake up to logs showing the same error repeated three times with no variation. It felt like watching someone walk into a glass door, back up, and walk into it again.

"Go to step" with context

The key detail that makes failure routing work is error context forwarding. When you route back to an earlier step, you don't just restart that step. You restart it with the failure information attached.

This matters a lot. Without context, the agent at step 1 has no idea why it's running again. It'll probably do the same thing it did the first time. But if you attach "your approach led to this error at validation: redis package not found," the agent has new information. It can make a different decision.

step: validate
  on_fail:
    context: "Validation failed: {{ error_output }}"
    goto: implement
    max_loops: 3
    escalate_to: pre-check (after 3 failed implement loops)
    final_fallback: skip_task

The {{ error_output }} gets injected into the prompt for the next step. So when implement runs again, it sees something like: "Previous attempt failed validation. Error: redis package not found in dependencies. Rewrite the caching implementation using only packages already in package.json."

That's not a blind retry. That's a targeted fix.

Cascading failures and the escalation ladder

Here's where it gets interesting. What happens when routing back to implement doesn't fix it?

You need an escalation ladder. Validate fails, go back to implement. If implement fails three times in a row (each time producing code that fails validation), you escalate further back to pre-check. The pre-check step re-evaluates the entire approach with all accumulated errors as context.

I think of it like debugging in real life. You hit a bug, you fix the line. Fix doesn't work, you look at the function. Function looks fine, you look at the architecture. Each level up gives you a wider view.

validate ✗ → implement (try 1)
validate ✗ → implement (try 2)
validate ✗ → implement (try 3)
  implement loop exhausted
  → escalate to pre-check with full error history
pre-check re-evaluates
implement (with new approach)
validate ✓

The error history accumulation is important. By the time pre-check runs again, it has three different implementation attempts and three different validation failures. That's a lot of context about what doesn't work. The agent can use that to find a fundamentally different approach instead of iterating on a broken one.

The thing about ordering

Failure routing forces you to think about your pipeline as a directed graph, not a sequence. Steps don't just flow forward. They have backward edges for failures.

In nightloop.sh this was painful. I had a tangle of bash conditionals checking exit codes and jumping around with functions. It looked like spaghetti and it acted like spaghetti. One of the reasons I built Zowl was to make this visual. You can see the failure routes on the pipeline editor. Forward edges for success, backward edges for failure, each annotated with the routing condition.

When you see a pipeline drawn as a graph, the failure paths become obvious. You can spot dead ends (steps that fail with no route back), you can spot infinite loops (two steps routing to each other with no max), and you can spot missing escalations (implement routes back to implement but never escalates to pre-check).

Practical advice

If you're setting up failure routing in your pipelines, here's what I've learned from running hundreds of them:

Always set a max loop count. Without it, two steps can bounce failures back and forth forever. I use 3 as the default. If three attempts at the same step can't fix it, the problem is upstream.

Don't route validation back to validation. Ever. If validation fails, the code is wrong. Checking the wrong code again won't make it right. Always route validation failures to implement or pre-check.

Accumulate errors, don't replace them. Each routing should append to the error context, not overwrite it. The agent needs to see the full history of what's been tried. "Attempt 1 failed because X. Attempt 2 failed because Y. Find an approach that avoids both."

Flag the third escalation for humans. If a task has gone through implement three times, escalated to pre-check, gone through implement three more times, and still fails? That task needs a person. Either the task description is wrong or there's a real ambiguity the agent can't resolve. Skip it, flag it, move on.

The overnight pipeline doesn't have to solve every task. It has to solve most tasks and clearly flag the ones it can't. Failure routing is how the flagging happens intelligently instead of after burning through all your retries at the wrong step. For more on handling failure with retries, see retry strategies for AI agents. And to understand the broader issue of avoiding one-shot agent execution, read stop one-shotting agents. If you want to set this up, Zowl handles failure routing automatically.