When NOT to use AI agents

I sell an AI orchestrator and I'm telling you not to use it for everything

This might be a weird post coming from the guy who built Zowl. My whole product is about running AI agents on pipelines of tasks. I should be telling you to automate everything.

But I won't. Because I've run hundreds of pipeline tasks over the past year, and I've developed a very clear picture of where agents shine and where they faceplant. Pretending otherwise would be dishonest, and you'd figure it out yourself after wasting tokens on the wrong kind of work.

Where agents are bad

UI design decisions. I'm not talking about "create a button component." Agents can do that fine. I'm talking about "make this dashboard feel professional" or "design a good onboarding flow." Anything that requires taste, visual judgment, or understanding how a human experiences a screen. Agents can't see the app running. They produce code that's syntactically correct and visually mediocre. I've gotten layouts that technically matched the spec and still looked like a CS student's homework.

If your task requires someone to look at the result and make subjective decisions about spacing, color, hierarchy, or flow, do it yourself or work with a designer. The agent doesn't have eyes.

Complex state management across many files. When state lives in one file or one module, agents handle it well. When state flows across 8 files through context providers, custom hooks, reducers, and side effects, agents start dropping threads. They'll update the state in one place and forget to handle the downstream effect in another.

I had a task that involved adding a new filter option to a data table. Sounds simple. But the filter state was shared across a sidebar component, a URL query parameter sync, a Redux slice, and a caching layer. The agent updated the Redux slice and the sidebar, missed the URL sync entirely, and broke the cache invalidation. Four files that all needed to change in coordination, and it got two of them.

If your state graph looks like a bowl of spaghetti, the agent will make it worse.

Performance optimization. Agents don't profile. They don't measure. They can't run your app and watch the waterfall chart. When you ask an agent to "optimize this component," it'll apply textbook patterns: memoize things, debounce inputs, lazy load components. Sometimes those help. Sometimes they make things worse because the actual bottleneck was a bad database query three layers down, not the React render.

Real performance work requires measurement, hypothesis, change, measurement again. It's iterative and empirical. Agents do "apply common patterns and hope." Those are different activities.

Anything that requires running the app and observing behavior. Animations. Scroll interactions. Race conditions that only appear under load. Responsive breakpoints that need visual checking. If the acceptance criteria is "it should feel smooth" or "it should look right on mobile," the agent has no way to verify its own work.

Where agents are great

Now the good news. There's a huge category of work where agents are not just fine but genuinely better than doing it manually.

CRUD endpoints. This is the agent sweet spot. "Create a REST endpoint for /api/users that supports GET, POST, PUT, DELETE with input validation and proper error codes." Clear spec. Predictable output. Easy to verify with tests. I've run dozens of these as pipeline tasks and the success rate is above 90%.

Test generation. Give an agent an existing function and tell it to write tests. It'll read the implementation, identify edge cases you missed, and produce a test file. It won't always catch the subtle stuff, but it'll get the obvious cases covered. I use this constantly for retroactive test coverage on code that shipped without tests.

Refactoring. Rename a module. Extract a utility function. Convert callbacks to async/await. Move files and update imports. This is mechanical work that agents handle well because the scope is clear and the success criteria is "it still works the same way, just organized differently."

Documentation. "Read this module and write JSDoc comments for every exported function." Done in 2 minutes. Would've taken me 30. The output isn't poetry, but it's accurate and consistent, which is all you need from docs.

Boilerplate. New project scaffolding, config files, CI pipeline definitions, Docker setups. Anything where you're essentially filling in a template with project-specific details. Agents eat this up.

Isolated components. A date picker. A markdown renderer. A toast notification system. Components with clear inputs, clear outputs, and minimal dependencies on external state. Agents build these well because the scope is contained. Nothing outside the component needs to change.

The pattern I use now

After burning tokens (and time) on tasks that shouldn't have been automated, I developed a simple filter. Before adding a task to a pipeline, I ask myself two questions:

Can the agent verify its own work? If there's a test suite, a linter, a type checker, or some other automated check that can tell the agent whether it succeeded, the task is probably a good fit. If the only way to verify is "look at it and decide if it's good," it's probably not.

Is the scope contained? If the task touches 1-3 files and has clear boundaries, go for it. If it requires understanding and modifying state that flows across a dozen files, think twice. The more files involved, the more likely the agent will miss a connection.

Good pipeline task:
- Touches 1-3 files
- Has automated validation (tests, types, lint)
- Clear acceptance criteria that a machine can check
- Doesn't require visual/subjective judgment

Bad pipeline task:
- Touches 5+ files with shared state
- Requires looking at the running app
- Success criteria is subjective ("looks good", "feels fast")
- Requires iterative measurement (performance, UX)

If a task fails both questions, I do it myself. Not because I'm faster, but because the agent will produce something that looks done but isn't, and I'll spend more time debugging its attempt than I would've spent just writing the code.

The real skill is knowing the boundary

The best pipeline in the world can't fix a task that shouldn't be automated. I've seen people throw entire features at an agent and get frustrated when the output is mediocre. The problem wasn't the agent. The problem was the task.

Break the feature into pieces. Some of those pieces are agent work. Some are human work. The CRUD layer? Agent. The complex state orchestration that ties it together? You. The test coverage? Agent. The visual polish? You.

When I plan a pipeline run, roughly 60-70% of the tasks go to the agent and 30-40% stay with me. That ratio feels right. The agent handles the volume. I handle the judgment calls. Neither of us does the other's job.

I'd rather be honest about that split than pretend agents can do everything. They can't. But the stuff they can do? They do it at 3am while I'm asleep. And that's enough to change how you work, as long as you're sending the right work. For more insights on running agents unsupervised, see our 30-day experiment, and learn about structuring work through proper task breakdown. To get started, try Zowl.