Multi-agent pipelines: Claude for thinking, Codex for coding

One agent doesn't fit all steps

I spent months running Claude Code for every step in my pipeline. Pre-check, implement, validate. All Claude. It worked fine, but I kept noticing the same thing: the implementation step was slow. Not wrong, just slow. Claude would spend extra time reading context it had already analyzed in pre-check, double-checking its work mid-generation, being generally cautious. Which is great when you want careful reasoning. Not great when you have 15 tasks to push through overnight.

Then Codex CLI came out and I started testing it on implementation. Night and day for certain types of tasks. (Learn more about the comparison in Claude Code vs Codex CLI.)

Why different agents for different steps

Here's the mental model. A pipeline has three jobs:

Pre-check: understand the codebase, validate the task, plan the approach
Implement: write the actual code
Validate: review what was written, run tests, check acceptance criteria

These are fundamentally different cognitive tasks. Pre-check is about reading, reasoning, and judgment. Implementation is about generation speed and code output. Validation is about critical analysis and comparison.

Claude is better at understanding existing codebases. It picks up on naming conventions, project structure, patterns that were established months ago. When I ask it "read src/ and tell me what patterns exist for error handling," it gives me a structured, thoughtful answer that actually maps to reality.

Codex is faster at writing new code when the plan is already clear. Give it a well-scoped task with explicit file paths and a clear spec, and it'll generate working code faster than Claude does. It's more aggressive, less careful, which is exactly what you want when the careful thinking already happened in the previous step.

The config

In Zowl, each step in a pipeline can specify its own agent. Here's what a multi-agent NightLoop config looks like:

pipeline: nightloop
steps:
  - name: pre-check
    agent: claude-code
    prompt: |
      Read the codebase at {{project_root}}/src.
      For the following task, identify:
      - Existing code that should be reused
      - Files that will need modification
      - Naming conventions to follow
      - Potential conflicts or blockers
      Task: {{task.description}}
      Output a structured implementation plan.

  - name: implement
    agent: codex-cli
    prompt: |
      Using the implementation plan from pre-check:
      {{steps.pre-check.output}}
      Implement the following task: {{task.description}}
      Follow the file paths and patterns identified above.

  - name: validate
    agent: claude-code
    prompt: |
      Review the changes made in the implement step.
      Original task: {{task.description}}
      Pre-check plan: {{steps.pre-check.output}}
      Git diff: {{steps.implement.diff}}
      Check:
      - Acceptance criteria met
      - Existing patterns followed
      - Tests pass
      - No duplicate code introduced

The key line is agent:. That's it. One field per step. You're not locked into anything across the pipeline.

Why Claude for pre-check

Pre-check is the step where you need the agent to actually understand your codebase. Not just parse it, understand it. "We use this pattern for errors." "This utility already handles date formatting." "The naming convention is camelCase for functions, PascalCase for components."

Claude is better at this kind of contextual reasoning. When I point it at a directory and ask it to describe the patterns it sees, the output is reliable enough to feed directly into the next step. It catches things like "there's already a useAuth hook that handles this, don't create a new one."

I've tested Codex on this step. It's not bad. But it tends to skim rather than read. It'll list the files in a directory without really absorbing the patterns between them. For simple pre-checks on small codebases, that's fine. For a project with 200+ files and established conventions, you want the agent that actually reads carefully.

Why Codex for implementation

Once pre-check has produced a plan, the implementation step is mostly generation. The thinking is done. The plan is clear. The file paths are identified. Now you just need something that'll write the code fast.

Codex shines here. Give it a clear spec and explicit constraints, and it'll rip through the implementation. On a batch of 12 tasks last week, Codex averaged 3.2 minutes per implementation step. Claude averaged 5.8 minutes on the same types of tasks. Both produced working code at similar quality levels, because the pre-check plan was feeding them the same context.

That time difference adds up. On a 20-task overnight run, saving 2.5 minutes per task means the pipeline finishes almost an hour earlier. Which means I wake up with results at 6am instead of 7am, and I have more time to review before standup. This efficiency gain aligns with techniques like pre-check that saves tokens.

# Rough timing from last week's batch
# 12 tasks, same repo, same PRD quality

Claude (all steps):     ~8.1 min/task avg
Codex (all steps):      ~5.4 min/task avg
Mixed (Claude+Codex):   ~6.2 min/task avg

# Mixed is faster than all-Claude and more reliable than all-Codex

Why Claude for validate

Validation is the step where you need the agent to be skeptical. "Does this diff actually satisfy the acceptance criteria?" "Did the implementation follow the plan from pre-check?" "Are the tests passing for the right reasons?"

This is judgment work. Same reason Claude handles pre-check well. It's better at comparing two things (the plan vs the result) and spotting discrepancies. I've had Codex validate its own output and mark things as PASS that clearly didn't meet the criteria. It's too optimistic about its own work. Which makes sense if you think about it: you don't want the same agent grading its own homework with zero friction.

Claude as the validator catches things like "the implementation created a new utility function instead of using the existing one identified in pre-check." That's exactly the kind of issue I built validation to catch, and it only works if the validator is paying close attention.

The real unlock

Mixing agents isn't about brand loyalty or which company you like more. It's about recognizing that "write code fast" and "analyze code carefully" are different skills. Humans specialize. Teams have architects and implementers and reviewers. Your pipeline can do the same thing.

The NightLoop pattern (pre-check, implement, validate) was always designed around separating these concerns into steps. Making each step use a different agent was the natural next move. Claude thinks, Codex builds, Claude reviews.

I run about 80% of my pipelines this way now. The other 20% are all-Claude for tasks where the implementation is complex enough that I want careful reasoning at every step. Refactoring work, architecture changes, anything touching auth or data models. For those, speed doesn't matter as much as getting it right.

But for the standard nightly batch of feature work, bug fixes, and CRUD endpoints? Mixed pipeline, every time. Faster, cheaper, same quality. That's the whole pitch.