What 'done' means to an AI agent

"Done" is a lie agents tell themselves

An AI agent considers a task done when it stops encountering errors. It writes some code, the linter doesn't scream, maybe a test passes, and it moves on. Task complete. Green checkmark. Ship it.

Except the function signature doesn't match the interface the rest of your codebase expects. And it added a dependency you already have a wrapper for. And the test it wrote checks that the function exists, not that it does the right thing.

The agent didn't fail. It just has a very different definition of "done" than you do.

I learned this the expensive way

Back in the nightloop.sh days, I had a task that said "add dark mode support to the settings page." Simple enough, right? I went to bed. Woke up. The agent had added a toggle. It toggled a CSS class. The class didn't exist anywhere in the stylesheet. But the code compiled and no tests failed, because there were no tests for the settings page.

The agent marked it done. Technically, it was. The toggle toggled. The fact that toggling it did absolutely nothing visible? Not the agent's problem. I never said "the theme should actually change when the user clicks the toggle." I assumed that was obvious.

Nothing is obvious to an agent.

What "done" actually means

When I write a task now, I force myself to answer one question before I submit it: how would a stranger verify this is done?

Not me. Not someone who knows the codebase. A stranger. Someone who can run commands and check outputs but has zero context about my intentions.

That question changes how you write acceptance criteria. Instead of "dark mode works," you get:

### Done When
1. Clicking the theme toggle in /settings switches the root
   `data-theme` attribute between "light" and "dark"
2. All components in src/components/ respect the data-theme
   attribute via CSS variables defined in globals.css
3. Theme preference persists in localStorage under key "theme"
4. Page load reads localStorage and applies the saved theme
   before first paint (no flash of wrong theme)
5. `npm run test` passes with no new failures
6. `npm run lint` shows no new warnings

That's a checklist a machine can work through. Each item is testable. Each item has a specific thing to verify. There's no ambiguity about what "works" means.

The optimism problem

Here's something I've noticed after running hundreds of pipeline tasks: agents are optimists. When they hit a fork in the road and the PRD doesn't specify which way to go, they pick the path that lets them keep writing code. They don't stop and ask. They don't flag uncertainty. They make a choice and keep going.

This is fine when the choice doesn't matter. It's a disaster when it does.

I had a task to "add input validation to the signup form." The agent validated the email format. Great. It did not validate password strength, because I didn't mention it. It did not add rate limiting, because that wasn't in scope. But it also didn't add an error message when validation failed. It just... prevented the form from submitting. Silently. No feedback to the user.

The agent's logic: validation means rejecting bad input. It rejected bad input. Done.

My expectation: validation means rejecting bad input and telling the user why. Two different things. I only specified one.

A framework that's worked for me

Every "Done When" section in my PRDs now follows a simple pattern. I stole it from QA, honestly.

State check: What should the system look like after this task? What data exists? What UI elements are visible? What config has changed?

Behavior check: What happens when a user (or another part of the system) interacts with the new code? What are the inputs and expected outputs?

Regression check: What should still work exactly the same as before? Which existing tests should pass? Which pages should render unchanged?

Negative check: What should explicitly not happen? What errors should not appear? What files should not be modified?

That last one is underrated. Telling an agent "do not modify any files in src/auth/" is sometimes more valuable than telling it what to do. Agents love to "fix" things they find along the way. Scope creep isn't just a human problem.

### Done When
- [ ] New endpoint returns 200 with valid payload
- [ ] New endpoint returns 422 with descriptive error on invalid input
- [ ] Existing test suite passes (`npm run test`)
- [ ] No files outside src/api/users/ were modified
- [ ] No new dependencies added to package.json
- [ ] Response shape matches the type in src/types/api.ts

The PRD is the contract

I think of the "Done When" section as a contract between me and the agent. Everything above it in the PRD is guidance. It's context, it's helpful, the agent needs it. But the "Done When" section is the binding part. It's what gets checked during the validation step of the pipeline. For a more thorough understanding of acceptance criteria structures, I've written a dedicated post on the topic.

In Zowl, the validation phase literally takes the "Done When" criteria and asks the agent to verify each one. If any item fails, the task fails. It doesn't matter if the code looks beautiful or the implementation is clever. Did it meet the criteria? Yes or no.

This is the part most people skip when they write tasks for agents. They describe the feature in detail but never describe what success looks like. It's like hiring a contractor and describing the kitchen you want but never mentioning it needs to pass inspection.

The litmus test

Before I submit any task to a pipeline now, I read just the "Done When" section in isolation. If I can't tell what the task was supposed to accomplish from the acceptance criteria alone, the criteria aren't specific enough.

Try it with your own tasks. Cover everything except the "Done When" section. Can you still tell what was being built? Can you verify each item with a command, a test, or a visual check?

If the answer is no, you're leaving "done" up to the agent. And the agent will choose the version of "done" that lets it stop working the fastest. Not out of laziness. Out of optimization. It solved the problem as stated. The problem is that you stated it wrong. Once your acceptance criteria are solid, you can integrate them into a testing framework. See pipeline testing for how I do this. And to get started building pipelines with clear completion criteria, check out Zowl.