Back to blog
#anthropic#validation#agent-harness

The Anthropic blog post that validated everything I built

Anthropic published a post about effective harnesses for long-running agents. Their solution maps 1:1 to the NightLoop pipeline I built months earlier.

The Anthropic blog post that validated everything I built

I almost dropped my coffee

A few weeks ago, Anthropic published a blog post about building effective harnesses for long-running coding agents. I read it standing in my kitchen at 7am and had to sit down.

Not because it was surprising. Because it was familiar. Uncomfortably familiar. Every recommendation they made, every pattern they described, was something I'd already built into Zowl months earlier. Not approximately. Almost word for word.

What Anthropic recommended

Their post laid out a framework for getting reliable results from agents that run for extended periods. Here's the core structure they described:

  1. An initializer agent that reads the codebase, understands context, and creates a plan before any code is written
  2. A coding agent that implements based on that plan, with clear scope and constraints
  3. Incremental progress tracking so you can see what happened step by step
  4. Testing and validation after implementation to verify the work is correct
  5. Structured logging of every step so you can debug failures after the fact

They also talked about the importance of keeping the initializer and the implementer as separate concerns. The agent that reads the code shouldn't be the same invocation that writes the code. Separate the thinking from the doing.

Now here's NightLoop

For context, this is the pipeline pattern I built into Zowl starting in late 2025, months before Anthropic published their post:

  1. Pre-check: reads the codebase, validates the task, identifies existing patterns, creates an implementation plan
  2. Implement: writes the code based on the pre-check output, scoped to specific files and patterns
  3. Validate: reviews the diff against the original task and acceptance criteria, runs tests

Each step logs its full output. Each step can use a different agent. Each step's output feeds into the next. Progress is tracked per-step with structured status (queued, running, passed, failed, skipped).

See the overlap? Because I see the overlap.

Mapping their findings to Zowl

Let me break down the 1:1 correspondence, because it's almost eerie.

Anthropic: "Use an initializer agent to read the codebase before coding begins."

Zowl: That's pre-check. Literally the first step of every NightLoop pipeline. I built this because my bash script kept generating code that ignored existing patterns. The agent would rewrite utilities that already existed. Pre-check fixed it. I wrote a whole blog post about how this saves 40% on tokens.

Anthropic: "Separate the planning phase from the implementation phase."

Zowl: Pre-check and implement are separate steps with separate agent invocations. I learned this the hard way: when you let the same invocation plan and implement, it tends to skip the plan or half-plan while already writing code. Forcing a hard boundary between "think about it" and "do it" made everything more reliable.

Anthropic: "Implement with clear scope constraints based on the plan."

Zowl: The implement step receives {{steps.pre-check.output}} as context. It doesn't get to freelance. It knows which files to touch, which patterns to follow, and what the expected output looks like. All of that comes from pre-check.

Anthropic: "Validate results with testing and verification."

Zowl: That's the validate step. I added this to nightloop.sh around version 0.5, after waking up to code that "looked fine in the diff" but broke tests. Validation compares the diff against acceptance criteria and runs the test suite. It's the last gate before a task gets marked as passed.

Anthropic: "Track progress incrementally with structured logging."

Zowl: Every step logs its full transcript. Token counts, durations, agent output, pass/fail status. This is session history in Zowl. I built it because debugging overnight failures from a wall of unstructured bash output was impossible.

Not bragging. Validating.

I want to be clear about why I'm writing this. I'm not saying "I thought of it first" like some kind of patent troll. Anthropic has a large team of researchers who are way smarter than me. They arrived at these patterns through systematic research. I arrived at them through 3am frustration and a growing bash script that kept failing in new and exciting ways.

The fact that we ended up at the same place is the interesting part. It means these patterns aren't arbitrary design choices. They're convergent solutions. When you actually run agents on real work for long enough, you inevitably discover that:

  • Agents need to read before they write
  • Planning and implementation should be separate
  • You need to validate the output, not trust it
  • Logging everything is non-negotiable

I found these patterns by running hundreds of overnight pipeline runs on my own projects. Anthropic found them by studying how to make agents effective at scale. Same destination, different roads.

The part they got exactly right

The thing from the Anthropic post that hit hardest was their point about the initializer agent being separate from the coding agent. They specifically called out that keeping them in the same invocation leads to the planner getting lazy or the implementer ignoring the plan.

I remember the exact night I discovered this. nightloop.sh v0.2 had a single prompt that said something like "read the codebase, then implement this task." The agent would read two files, say "looks good," and start coding immediately. The "read the codebase" part became a formality. A checkbox it ticked in 10 seconds before doing what it was going to do anyway.

When I split it into two separate invocations (v0.3), the quality of the pre-check output jumped immediately. Because now pre-check's entire job was reading and planning. It couldn't jump ahead to coding because it literally didn't have that instruction. And the implement step got a real plan instead of a half-baked afterthought.

Anthropic's research confirmed what I'd observed: the boundary between thinking and doing needs to be enforced structurally. Not with prompt engineering. With actual architectural separation.

What this means for you

If you're running agents manually (copy-pasting tasks, watching the output, running tests yourself), you're going to independently discover these same patterns. Everyone does. The only question is how many wasted nights it takes.

The pipeline structure isn't some complex framework I invented or that Anthropic's research team conjured from theory. It's what everyone builds once they've been burned enough times. Read before you write. Validate after you implement. Log everything.

The Anthropic post just gave it an academic stamp. Zowl gives it a run button.

When the company that builds the model tells you "here's how to harness agents effectively" and their answer is the same architecture you've been shipping for months, that's a good sign. Not because it makes me smart. Because it means the architecture is right. The patterns are settled. The question isn't whether to use them. It's how fast you can set them up. That's where the pipeline-as-infrastructure approach comes in, and why Zowl was designed around this validated pattern from the start.