Session history: your overnight black box recorder

The morning ritual I hated

For the first two months of running nightloop.sh, my mornings looked like this: wake up, open terminal, run cat nightloop.log | less, and start scrolling. Hundreds of lines of raw output, mixed together. Agent thinking out loud. Bash stderr. Token counts buried somewhere in the middle. No timestamps on half of it. ANSI color codes making everything unreadable.

If a task failed at 3am, I had to scroll through everything that ran between midnight and the failure, guess which output belonged to which task, and piece together what went wrong. Some mornings this took 20 minutes. Some mornings I gave up and just re-ran the task manually.

That's when I realized: if you're running agents overnight, the logging isn't a feature. It's the entire foundation. Without it you're launching rockets with no flight recorder.

What session history actually captures

In Zowl, every pipeline run produces a session history. Not a log file. A structured record of everything that happened, broken down by step. This approach is discussed in detail in how we debug failed overnight tasks, which is part of our broader debugging methodology, and ties into strategies for saving tokens across your pipeline operations.

Here's what gets captured for each step in the pipeline:

Step: pre-check
Status: PASSED
Agent: claude-code
Started: 2026-04-28T01:14:33Z
Finished: 2026-04-28T01:17:12Z
Duration: 2m 39s
Tokens in: 4,218
Tokens out: 1,847
Total tokens: 6,065

Output:
  Identified files: src/api/users.ts, src/lib/auth.ts, src/types/user.d.ts
  Existing patterns: Express middleware chain, Zod validation on inputs
  Plan: Add email verification endpoint following existing auth pattern
  Conflicts: None identified
  Recommendation: PROCEED

Every step looks like this. Pre-check, implement, validate. Each one with its own timing, token count, status, and full output. When you open a session in the morning, you're not scrolling through a wall of text. You're looking at a timeline of discrete events.

When things go wrong at 3am

Last month I had a pipeline run with 14 tasks. Woke up and saw that tasks 8 through 14 all failed. My first instinct was panic. Then I opened the session history for task 8 and saw this:

Step: pre-check
Status: PASSED
Duration: 2m 11s

Step: implement
Status: FAILED
Duration: 8m 43s
Error: Agent exceeded token budget (max: 50,000, used: 50,000)

Output (truncated):
  ...modifying src/api/payments.ts...
  ...adding new endpoint...
  ...wait, the existing payment module uses a different ORM pattern
  than what the PRD assumes. Let me refactor the existing code first
  to match...

Step: validate
Status: SKIPPED (upstream failure)

There it is. The agent hit the token budget because it decided to refactor existing code that wasn't part of the task. Pre-check should've caught this (the PRD assumed a pattern that didn't match reality), but the pre-check prompt wasn't specific enough about verifying ORM patterns.

Tasks 9-14 failed because they had a dependency on task 8. Cascading failure, but the root cause was clear in 30 seconds.

Without session history, I would've spent the morning re-running each task manually to reproduce the failure. Instead, I fixed the PRD for task 8, tightened the pre-check prompt, and re-ran the batch. Done before standup.

When things go right

Here's the part people don't expect. Session history is just as useful when everything works.

I had a pipeline last week where all 10 tasks passed on the first run. No failures, no retries. Instead of just moving on, I spent 15 minutes reading through the session histories. I wanted to know why they worked so well.

What I found: the pre-check steps were averaging only 1,800 tokens out. Short, focused, no rambling. The plans were tight. And the implement steps were following those plans closely, averaging 3.1 minutes each. The validate steps all passed without flagging issues.

That told me something about the PRDs. They were well-scoped, specific, and matched the actual codebase. The pre-check prompts were dialed in. I saved those PRD templates and started reusing them. The success wasn't luck. It was a pattern I could replicate, and I only knew that because I could read the full session.

Token tracking changes how you write PRDs

One of the things session history tracks is token usage per step. Over time, this becomes a feedback loop for how you write tasks.

I noticed that my PRDs for "add a new API endpoint" tasks consistently used around 18,000 tokens in the implement step. But PRDs for "modify an existing endpoint" tasks were hitting 30,000-35,000 tokens for the same complexity. Why?

The session history told me. On modification tasks, the agent was spending tokens re-reading files it had already analyzed in pre-check. The implement prompt wasn't passing along the relevant file contents from pre-check. The agent was doing double work.

I updated the pipeline to include {{steps.pre-check.files}} in the implement prompt, so the agent received the pre-check's file analysis directly. Token usage on modification tasks dropped to 20,000-22,000. A 35% savings that I only found because I was reading the token counts in session history.

# Before: implement step re-reads everything
implement tokens (modification tasks): ~32,000 avg

# After: implement step receives pre-check context
implement tokens (modification tasks): ~21,000 avg

# Savings over a 10-task batch: ~110,000 tokens

That's real money over time. And I wouldn't have found it without per-step token tracking.

The debugging loop

When a task fails validation, here's the workflow I use with session history:

Open the failed session
Read the validate step output: what specific criteria failed?
Read the implement step output: did the agent follow the pre-check plan?
Read the pre-check step output: was the plan correct given the codebase?
Find the broken link in the chain and fix it

Most failures trace back to one of three things:

Bad PRD: the task description was ambiguous or assumed something about the codebase that wasn't true
Weak pre-check: the pre-check didn't identify a relevant pattern or constraint
Token budget: the implement step ran out of budget mid-task

Each of these has a different fix. Bad PRD means rewrite the task. Weak pre-check means update the pre-check prompt. Token budget means split the task into smaller pieces. Session history tells you which fix to apply.

Without it, every failure looks the same: "the agent didn't do what I wanted." That's not actionable. "The agent didn't follow the existing error handling pattern because pre-check didn't scan src/lib/errors.ts" is actionable. That's the difference.

It's the foundation, not a feature

I built session history into Zowl before I built retry logic. Before I built failure routing. Before I built the visual pipeline editor. It was the second thing I built after the basic pipeline engine itself.

Because here's the thing: you can't improve what you can't see. Retry logic is useless if you don't know what to retry differently. Failure routing is useless if you can't tell whether a failure was the agent's fault or the PRD's fault. Every optimization I've made to Zowl pipelines started with reading a session history and noticing something.

If you're running agents overnight without structured logging, you're making decisions based on vibes. "I think that task failed because..." No. Open the session. Read the transcript. Know exactly what happened. Then fix it.

Your agents are running while you sleep. The least they can do is keep a diary. To learn more about how to build robust systems, check out Zowl, where session history is the foundation of everything.