Your pipeline needs a testing step. Here's why.
Your agent wrote 500 lines overnight. Do you trust it? The validate step is the difference between shipping code and shipping bugs.
Your pipeline needs a testing step. Here's why.
500 lines and a false sense of security
Your agent ran overnight. You open your laptop, check the diff. 500 new lines of code. Tests pass. Types check out. The commit message is even well-written.
Ship it?
No. Absolutely not. Not without a validate step that actually verifies the code does what you asked for. Because "compiles and passes tests" is a remarkably low bar when you think about it. Code can be type-safe, test-passing, and completely wrong.
The bug that taught me this
November 2024. I had nightloop.sh running a batch of 12 tasks for a client's API. One task was "add rate limiting to the /upload endpoint, max 10 requests per minute per user." The agent wrote the middleware, added tests, everything passed. I merged it that morning without a thorough review because I was in a rush.
Two days later the client reports that uploads are being blocked after 10 requests total. Not per user. Per server. The agent implemented a global counter instead of a per-user counter. The test it wrote? Also tested the global counter. Test passed because the test was wrong in the same way the code was wrong.
Task: "Add rate limiting, max 10 req/min per user"
What the agent wrote:
- Global in-memory counter
- Resets every 60 seconds
- Test checks: counter increments, resets after 60s
- All tests pass ✓
What the agent SHOULD have written:
- Per-user counter keyed by user ID
- Each user gets their own 60-second window
- Test checks: user A hits limit, user B still has quota
- That test would have caught the bug
The agent's tests verified its own implementation, not the acceptance criteria. It tested what it built, not what I asked for.
Validate is not "run the tests"
This is the distinction most people miss. Running npm test is a check. It tells you the code doesn't crash. But validation is about matching output to intent.
In the NightLoop pipeline, the validate step does something different from just running a test suite. It reads the original task description, reads the diff, and then checks whether the implementation actually satisfies the acceptance criteria. It's a semantic check, not just a syntax check.
Pre-check: "Read the codebase, understand the task, flag issues"
Implement: "Write the code"
Validate: "Does the code actually do what the task asked for?"
That third step is where the magic happens. The validator has access to both the original prompt and the resulting code. It can catch mismatches that no unit test would find because the unit test was written by the same agent that misunderstood the requirement.
Real bugs caught by validate
I've been running pipelines with validation for over a year now. Here's a sample of stuff the validate step caught that would have shipped without it.
The silent no-op. Task: "Add logging to all database queries." The agent added a logger import and a single log statement in the connection setup. It logged when the database connected but not individual queries. Without validation, this would've looked done. The validate step flagged it: "Task requires logging on all queries, but only connection events are logged."
The wrong file edit. Task: "Update the pricing component to show annual billing." The agent created a brand new PricingAnnual.tsx component instead of modifying the existing Pricing.tsx. The new component wasn't imported anywhere. Tests passed because no existing test referenced the new file. The validator caught it: "New component created but not integrated into any route or layout."
The partial implementation. Task: "Add search with filters for category, date range, and status." The agent built search with category and status filters but quietly skipped date range. Probably hit some complexity around date parsing and just... didn't do it. No error, no comment, just silently incomplete. The validator compared the acceptance criteria against the implementation and flagged the missing filter.
The copy-paste ghost. Task: "Refactor the email service to use templates." The agent refactored the email service but left the old inline HTML strings in a different file that also sent emails. Now two systems existed: one using templates, one using raw strings. The validate step caught the duplicate because it checked the full scope of changes against the task intent.
What a validate step actually looks like
In Zowl, the validate step runs after implementation. You can configure what it checks, but the default NightLoop template does this:
- Reads the original task and its acceptance criteria
- Reads the git diff from the implement step
- Runs the project's test suite
- Compares the diff against each acceptance criterion
- Outputs PASS or FAIL with specific reasons
If it fails, the pipeline doesn't just stop. It routes back to the implement step with the validation errors attached. The agent gets a second shot, but now it knows exactly what was wrong. Not "something failed," but "you implemented a global counter instead of per-user." This is where retry strategies become critical in your pipeline design. With a platform like Zowl, these validation steps are built into the pipeline orchestration, making it easy to catch semantic bugs automatically.
step: validate
checks:
- run_tests: true
- match_acceptance_criteria: true
- check_for_unintended_changes: true
on_fail:
retry: 0
goto: implement (with error context)
max_loops: 3
fallback: skip_task, flag_for_review
The retry: 0 on validate is intentional. If validation fails, retrying validation won't change anything. The code is still wrong. You go back to implement, always.
The numbers
Before I added validation to nightloop.sh, about 15% of my overnight tasks had subtle bugs that I'd miss during morning review. Not crashes. Not type errors. Subtle stuff like wrong behavior, missing edge cases, partial implementations. The kind of bugs that make it to staging and then make you feel stupid.
After adding validation, that number dropped to around 3%. And the remaining 3% are genuinely tricky issues where the validator also missed the nuance. Those are fair. A 5x reduction in bugs that ship is not.
You're already paying for the tokens
I hear the objection: "validation adds another agent call, that's more tokens." Yes. One extra LLM call per task. For a 20-task pipeline, that's 20 additional calls. Maybe a couple bucks in API costs.
You know what costs more? Debugging a bug in production that the agent introduced. Or spending your morning review finding issues that a machine could have caught at 3am. Your review time is worth more than those tokens.
What I'd tell myself a year ago
If you're running any kind of overnight agent workflow and you don't have a validation step, you're essentially shipping code with no review. The agent is not your reviewer. The agent that wrote the code can't review the code. You need a separate pass that compares intent to output. This aligns with defining "done" for AI agents. Acceptance criteria need validation.
Pre-check, implement, validate. Three steps. The middle one gets all the attention, but the last one is what makes the whole thing trustworthy. Without it, you're just generating code and hoping for the best. And hope is not a deployment strategy.