30 days of unsupervised AI agents: what I learned

The experiment

In March I decided to track every single pipeline run across all my projects for 30 consecutive days. Not cherry-picking the good nights. Not skipping the weekends where I was sloppy with PRDs. Every run, every task, logged in a spreadsheet with the result and a short note about what happened.

I wanted real numbers. Not vibes, not "it mostly works," not a Twitter thread where I only show the wins. Just the data.

The setup

Three active projects during this period. A SaaS dashboard for a client, a personal side project (recipe app with AI-generated meal plans), and Zowl itself. Yes, I use Zowl to build Zowl. It's turtles all the way down.

Every night I'd load up between 3 and 15 tasks, all running the NightLoop pipeline: pre-check, implement, validate. Some nights I'd run a few during the day too, when I was impatient. Total across 30 days: 247 tasks submitted.

The raw numbers

Here's what the spreadsheet looked like at the end:

| Result | Count | % | |---|---|---| | Passed first run | 173 | 70% | | Passed after PRD fix + re-run | 24 | 10% | | Passed after minor manual edit | 19 | 8% | | Failed, had to redo manually | 22 | 9% | | Skipped (dependency failure) | 9 | 3% |

So 70% just worked on the first attempt. Another 18% needed a small nudge, either a PRD tweak and re-run, or a quick manual edit to the output. Total usable output: roughly 88%.

The 9% that completely failed? Those are the interesting ones.

What worked every time

Some categories had near-perfect success rates. I'm talking 95%+ first-run pass.

CRUD endpoints. If the PRD said "create a REST endpoint for X with these fields and these validations," the agent nailed it basically every time. This is the bread and butter. Well-defined inputs, well-defined outputs, predictable file locations.

Test generation. Writing tests for existing code was shockingly reliable. You point the agent at a module, tell it what to test and what edge cases to cover, and it writes tests that actually catch real bugs. I found three bugs in my own code this month because the generated tests failed on edge cases I hadn't considered.

Refactoring with clear rules. "Rename all instances of X to Y," "extract this repeated block into a shared utility," "convert these callbacks to async/await." Mechanical transformations with explicit before-and-after expectations. The agent is basically a very smart find-and-replace. It loves this stuff.

Documentation and types. Adding JSDoc comments, generating TypeScript interfaces from existing JavaScript, writing README sections from code. Low ambiguity, high success.

What failed consistently

UI that requires visual judgment. I tried a task that said "make the settings page look cleaner and more modern." What does "cleaner" mean? What does "modern" mean? The agent can't see the screen. It made changes that were technically valid CSS but looked worse. I tried this three separate times with different phrasings and gave up. Visual design tasks need mockups or very precise specs, not adjectives.

Complex multi-system architecture decisions. One task asked the agent to "design the caching layer for the API." That's not a task, that's a conversation. Where do you cache? Redis? In-memory? CDN? What's the invalidation strategy? What are the consistency requirements? The agent picked a reasonable approach, but it wasn't the right approach for my specific setup. Architecture needs human judgment about tradeoffs that aren't in the codebase.

Tasks with ambiguous acceptance criteria. Any PRD where "done" was subjective had a bad time. "Improve error handling" failed. "Add try-catch to all database calls in src/api/ and return structured error responses with status code, error type, and user-safe message" passed. Every time.

Anything touching multiple services simultaneously. A task that needed to update the API, the database schema, and the frontend in one shot was too much surface area. The agent would get one part right and fumble the coordination between them. Breaking it into three sequential tasks with proper dependencies solved it.

The 80% rule

After crunching the data, I landed on a rule of thumb I keep coming back to. If the PRD is well-written, the success rate is around 80% first-run and 95% with one retry. If the PRD is vague, the success rate drops to maybe 30%.

The variable isn't the agent. It's you.

I went back and tagged every failed task with a root cause. Here's the breakdown:

Vague or incomplete PRD:         41% of failures
Task too large / too many files: 23% of failures
Missing architectural context:   18% of failures
Genuine agent mistake:           12% of failures
Flaky test / environment issue:   6% of failures

Only 12% of failures were actually the agent doing something wrong with good instructions. Everything else was my fault. That stung a little, but it's also good news. It means the fix is in my hands, not waiting on a model upgrade.

The token cost question

People always ask about cost. I won't pretend it's free. Running 247 tasks across 30 days burned through a meaningful amount of tokens. But here's how I think about it.

Each task that passes saves me somewhere between 15 minutes and 2 hours of focused coding time. Even at the conservative end, 197 successful tasks times 20 minutes each is about 65 hours of work done in 30 days. Most of that happened while I was sleeping or eating dinner or walking the dog.

The pre-check step actually saves tokens too. About 14% of my tasks got flagged and stopped at pre-check before the expensive implementation step even started. That's 14% of tasks where I would've burned full implementation tokens on something that was going to fail anyway. This pre-check approach saves significant tokens and is core to how I optimized my pipeline.

What changed in my workflow

By week two, I'd stopped thinking of the pipeline as something I "try" and started treating it as the default. My workflow now looks like this:

Morning: review last night's results, merge the passes, fix and re-run the failures. Afternoon: write PRDs for tonight's batch. Evening: load the pipeline, hit start, close the laptop. I've learned to write PRDs that agents actually understand, which has been key to improving success rates.

I write more PRDs than code now. That felt weird at first. But writing a detailed PRD takes 10-15 minutes, and the resulting code would've taken me an hour or more. The math just works.

Honest assessment

Zowl doesn't replace me. I still make every architecture decision, review every line of output, and write every PRD. What it replaces is the 6-8 hours of mechanical coding I used to do each day. The typing, the looking up syntax, the writing boilerplate, the running tests and fixing the obvious stuff.

Thirty days of data convinced me this isn't a gimmick. It's not perfect. It's not autonomous in the sci-fi sense. But 88% usable output on real production code, running while I sleep? I'll take that over typing it myself at midnight any day of the week.

The spreadsheet's still open. Day 31 started last night. If you're running agents without a pipeline like this, check out Zowl. It's built specifically for this overnight workflow.