Acceptance criteria AI agents actually understand

"It should work correctly"

I found this in my own task history from last October. Actual acceptance criterion I wrote for a pagination endpoint: "It should work correctly."

That's it. That was the whole criterion. I submitted it to the pipeline, went to bed, and woke up to an implementation that technically worked but used offset-based pagination when the rest of the codebase used cursor-based. No limit parameter validation. No empty-state handling. The agent did what I asked. It made something that "worked correctly." My definition of correctly and its definition of correctly were two different things.

This is the #1 skill gap I see when devs start using AI agents. They know how to write acceptance criteria for a human teammate who can infer context, ask follow-up questions, and check Slack history. An AI agent can't do any of that. It takes your words literally, fills the gaps with statistical guesses, and ships whatever comes out.

Bad criteria vs good criteria

Let me show you the difference with a real example. Say you need to add a search endpoint to your API.

Bad acceptance criteria:

### Done When
- Search works
- Results are returned correctly
- It handles errors
- Performance is good

Every single line here is useless. "Search works" is not a testable statement. "Results are returned correctly" begs the question: what does correctly mean? "It handles errors" - which errors? "Performance is good" - compared to what?

A human dev reads that and fills in dozens of assumptions from experience. They know you probably want a 200 response with a JSON array. They'll probably add a try-catch. They know "performance is good" means don't do a full table scan. But they know these things because they've been building APIs for years and can read between the lines.

The agent reads between the lines too. Except its "between the lines" comes from the aggregate of all the code it's ever seen in training, which might not match your project at all.

Good acceptance criteria:

### Done When
- GET /api/products/search?q=term&page=1&limit=20 returns 200
  with { results: Product[], total: number, page: number, hasMore: boolean }
- Empty query string → 400 with { error: "Search query required" }
- No results → 200 with { results: [], total: 0, page: 1, hasMore: false }
- limit > 100 → clamp to 100 silently (don't error)
- Search uses the existing full-text index on products.name and products.description
- Response time < 200ms for queries returning < 100 results (test with existing seed data)
- New test file: src/__tests__/api/product-search.test.ts with at least 6 cases
- npm run test passes
- npm run lint passes with no new warnings

Look at the difference. Every criterion is testable. The agent can check each one after implementation and know with certainty whether it passed or failed. There's no ambiguity about what "works" means.

The three rules

After burning through a few hundred tasks over the nightloop.sh era, I landed on three rules for acceptance criteria that agents can actually use. These work even better when you have a solid PRD and clear definitions of what done means.

Rule 1: If a human would need to ask a clarifying question, rewrite it.

Read each criterion and pretend you're a new hire on day one. If you'd need to tap someone on the shoulder and ask "what do you mean by this?" then the criterion isn't specific enough. The agent won't tap you on the shoulder. It'll just guess.

Rule 2: Reference real paths, real functions, real commands.

Don't say "add tests." Say "add tests in src/__tests__/api/payments.test.ts using the existing createTestUser() helper from src/__tests__/helpers.ts." Don't say "follow existing patterns." Say "follow the pattern in src/api/users/route.ts for error handling and response format."

File paths are the most underused tool in acceptance criteria. The agent has access to your entire codebase. Pointing it at a specific file as a reference is like giving a contractor a photo of exactly what you want. It removes guesswork.

Rule 3: Include the verification command.

Every acceptance criteria block should end with the exact commands to run. Not "make sure tests pass" but npm run test -- --testPathPattern="product-search". Not "check for type errors" but npx tsc --noEmit. Give the agent the literal command it should run to verify its own work.

In Zowl pipelines, the validation step runs these commands automatically. But the agent still needs to know what success looks like during implementation, so it can self-correct before validation even starts.

The pattern file trick

Here's something that made a big difference for me. Instead of repeating the same criteria patterns in every task, I created a CONVENTIONS.md file at the root of each project:

# Project Conventions

## API Endpoints
- All routes return { data, error, meta } envelope
- Success: 200 with data field populated
- Validation error: 400 with error field
- Not found: 404 with error field
- Auth required: 401 with error "Unauthorized"

## Testing
- Test files mirror source structure: src/api/foo.ts → src/__tests__/api/foo.test.ts
- Use createTestContext() from src/__tests__/helpers.ts for setup
- Minimum 4 test cases per endpoint: happy path, validation error, not found, auth

## Error Handling
- Use AppError class from src/lib/errors.ts
- Never expose stack traces in production responses
- Log all 500s with requestId from context

Then in my acceptance criteria I just reference it:

### Done When
- Follows all patterns in CONVENTIONS.md
- GET /api/orders/:id returns correct envelope format
- Handles missing order (404 per CONVENTIONS.md pattern)
- Test file at correct path per CONVENTIONS.md
- npm run test passes

The agent reads CONVENTIONS.md during the pre-check step and absorbs all those patterns. You don't have to repeat yourself in every task.

The acceptance criteria I'm most proud of

Last month I wrote a task for adding webhook retry logic to a payment integration. The acceptance criteria section was 23 lines long. Longer than the actual task description. It specified every status code, every retry interval, the exact table to log attempts to, the exact function signature, and the dead-letter behavior when retries were exhausted.

The agent nailed it on the first run. Zero rework. The implementation matched what I had in my head almost exactly, because I'd taken the time to get what was in my head onto the page.

That's the thing about acceptance criteria for agents. It feels slow to write them. It feels like busywork when you could just say "add webhooks" and let the agent figure it out. But "figure it out" is where your tokens go to die and your mornings go to rework.

Start here

If you take one thing from this post, let it be this: go look at the last task you gave an AI agent. Read the acceptance criteria (if you even wrote any). Ask yourself whether a brand-new developer with zero context about your project could implement it correctly from those criteria alone.

If the answer is no, that's your problem. Not the agent's.

Write the criteria you wish your agent could ask for. Be specific. Reference files. Include commands. And test your criteria by reading them cold, pretending you know nothing about the project.

It takes an extra 10-15 minutes per task. It saves hours of rework and thousands of tokens per pipeline run. And honestly, writing precise acceptance criteria has made me a better engineer even when I'm writing the code myself. Turns out, knowing exactly what "done" means before you start is a useful habit regardless of who's doing the implementation. If you're using these strategies in a pipeline, Zowl is built to enforce these patterns automatically.