Back to blog
#workflow#task-breakdown#pipelines

Stop one-shotting your AI agents

The biggest mistake devs make is giving the agent one massive prompt and hoping for the best. That's one-shotting. It fails.

Stop one-shotting your AI agents

The 4,000-word prompt that broke everything

A few months ago I watched a friend try to get Claude to build an entire feature in one shot. He opened the terminal, typed out a prompt that was basically a full spec, and hit enter. It was beautiful, honestly. Detailed requirements, example API responses, database schema, the works. About 4,000 words in a single message.

The agent ran for 12 minutes. It created 9 files, modified 4 others, and produced something that looked impressive in the diff. Then he ran the tests. Nothing passed. The auth middleware referenced a utility that didn't exist yet. The database migration had the right columns but wrong types. The frontend component imported a hook that the agent planned to create but never got around to because it ran out of context halfway through.

He spent the next three hours untangling it. I bought him a beer and told him about pipelines.

What one-shotting actually is

One-shotting means giving an AI agent a single, large prompt and expecting it to deliver a complete, working result in one pass. You describe the whole feature, the whole refactor, the whole migration, and the agent attempts everything at once.

It feels efficient. It's not.

Here's why it breaks down.

Context windows have limits. Not just the technical token limit, but the practical one. The more instructions you pack into one prompt, the more likely the agent is to lose track of details buried in paragraph six. It's like handing someone a 30-page spec and asking them to implement it from memory without looking back at the document. Important details get dropped.

Everything is coupled. In a one-shot prompt, step 5 depends on step 3, which depends on step 1. If the agent makes a wrong call at step 1, every subsequent step inherits that mistake. There's no checkpoint. There's no moment where something says "hey, step 1 is wrong, let's stop." The error compounds silently until the whole output is a tangled mess.

No validation until the end. You don't find out anything failed until everything's done. By that point, the agent has written 800 lines across a dozen files, and the bug could be anywhere. Debugging a one-shot output is worse than debugging code you wrote yourself, because you don't have the mental model of how it was built.

The pipeline alternative

A pipeline breaks the work into discrete, sequential steps. Each step runs independently with its own context, its own instructions, and its own validation. If step 3 fails, you know exactly where and why. Steps 4 through 10 don't run on a broken foundation. This approach is central to how we think about building effective pipelines and handling failures in agent workflows.

The NightLoop pipeline in Zowl has three stages per task:

Pre-check → Implement → Validate

Pre-check reads the codebase and evaluates whether the task makes sense given the current state of the project. It catches problems before any code is written. Wrong file paths, missing dependencies, conflicting patterns. If the pre-check fails, the task stops immediately. Zero tokens wasted on implementation.

Implement is the actual coding step. But it only runs after the pre-check passes, which means the agent starts with a validated understanding of the codebase. It's not guessing about the project structure. It already looked.

Validate runs after implementation. It checks the diff against the acceptance criteria, runs the test suite, checks for lint errors. If validation fails, the task is marked as failed with a specific reason. You see exactly what went wrong.

Three steps. Each one is a guardrail.

But the real trick is task size

Pipelines alone aren't enough. You also have to break the work down into the right-sized pieces.

Here's a real example from last week. I needed to add a notification system to a project. The feature included: a notifications table in the database, an API endpoint to create notifications, an endpoint to fetch and mark as read, a WebSocket channel for real-time delivery, and a frontend dropdown component to display them.

The one-shot version would be: "Build a notification system with database, API, real-time delivery, and frontend component."

That's five different concerns in one prompt. The agent has to make decisions about all five simultaneously. If it picks the wrong WebSocket library, the frontend component breaks. If the database schema is off, the API endpoints return wrong data. Everything is tangled.

Here's how I actually broke it down:

Task 1: Create notifications table
- Migration with columns: id, user_id, type, title, body, read, created_at
- Add index on (user_id, read, created_at)

Task 2: Create POST /api/notifications endpoint
- Accept: { user_id, type, title, body }
- Validate inputs, insert into notifications table
- Return the created notification

Task 3: Create GET /api/notifications endpoint
- Fetch notifications for authenticated user
- Support ?unread=true filter
- Return paginated results, newest first

Task 4: Create PATCH /api/notifications/:id/read endpoint
- Mark single notification as read
- Return 204

Task 5: Add WebSocket channel for real-time notifications
- On new notification insert, broadcast to user's channel
- Use the existing ws server in src/lib/ws.ts

Task 6: Build NotificationDropdown component
- Bell icon with unread count badge
- Dropdown shows last 10 notifications
- Click marks as read
- Subscribe to WebSocket channel for live updates

Six tasks instead of one. Each task has a single concern. Each task can be validated independently. If task 5 fails because the WebSocket setup is wrong, tasks 1-4 still produced working, mergeable code. I fix task 5's PRD, re-run just that one, and move on.

How small is too small?

I get this question a lot. Here's my rule of thumb: a task should touch no more than 3-4 files and produce a change that can be tested in isolation. If you can't describe a clear "done when" for the task, it's either too big or too vague.

Too big: "Build the settings page." That's a page, API endpoints, database fields, form validation, and probably auth checks. Five tasks minimum.

Too small: "Add a comma to line 47 of config.ts." You don't need a pipeline for that. Just do it.

The sweet spot is something like: "Create the GET /api/settings endpoint that returns the current user's preferences from the settings table, with a default fallback if no row exists." One endpoint, one file, testable, takes the agent maybe 3 minutes. Done.

The compound effect

Here's what people miss about pipelines vs. one-shotting. It's not just about reliability on a single task. It's about what happens over a full night.

Say you have 12 tasks to run overnight. With one-shotting, you'd maybe group them into 2-3 big prompts. If one big prompt fails, you lose 4-6 tasks worth of work. Morning is spent debugging and redoing.

With a pipeline, each of those 12 tasks runs independently through pre-check, implement, validate. If 2 tasks fail, you still get 10 completed. The failures have specific error messages from the validation step. You fix two PRDs, re-run just those two, and you're done before your second cup of coffee.

Over a month, that difference compounds. I tracked this in my own usage. One-shot approaches gave me maybe 50-60% usable output. The same work broken into pipeline tasks hit 85-90%. Same agent, same model, same codebase. The only variable was how I structured the work.

This isn't new

If you've worked with CI/CD, this pattern feels familiar. You wouldn't write a CI pipeline that does build + test + deploy + notify in a single shell command with no error handling. You break it into stages. Each stage has a clear input, a clear output, and a pass/fail gate.

AI agents need the same discipline. They're not magic. They're software executing instructions. Give them one massive instruction and they'll fail the same way a 500-line bash script with no error handling fails. Give them structured, sequential, validated steps and they'll produce results you can actually use.

The tooling shouldn't matter for this advice. Whether you use Zowl or a bash script or something you built yourself, the principle is the same: stop one-shotting. Break the work down. Validate each step. Let the pipeline handle the sequencing.

Your agent isn't bad at its job. You're just giving it the whole job at once and wondering why it drops things. Break it up. You'll sleep better, literally. Learn more about building agent pipelines at Zowl.