From tmux to DBOS — Rebuilding a 6-Agent Build Pipeline on Postgres

What I'm running

I build ebatt.ai with a six-agent loop — Planner, Builder, Reviewer, Tester, Deployer, Docs — passing structured handoffs between Claude sessions. The pattern is Boris Cherny's; the implementation is mine. I call it IWO (Ivan's Workflow). Each agent runs in its own tmux pane, picks up the latest handoff for its role, does its work, and writes the next one. A daemon watches the handoff directory and dispatches the next claude -p call when a new handoff lands.

For most of 2026 the daemon was a Python long-running process that kept everything in memory and persisted just enough to a .active-specs.json reconciliation file to survive a restart. It worked. It also failed in ways that wasted real money and time, and the failure modes were exactly the ones you would predict if you have ever shipped a stateful daemon: state in RAM, side effects without checkpoints, no durable wait primitive.

This is the story of replacing that daemon with DBOS and what changed for the way I develop ebatt.ai.

The four things the legacy daemon could not give me

Resuming a spec after a laptop reboot

If the daemon died mid-spec, I would come back to a partially advanced workflow with handoff files on disk that did not match the in-memory state. The .active-specs.json file was supposed to bridge the gap but it lagged the actual state by however long it had been since the last checkpoint, which was rarely the right amount. I'd hand-patch the file or just kill the spec and restart. Either way I lost work.

Waiting for a human approval that might take hours

The Reviewer agent often raises a blocker that needs my judgment — a scope question, a risk acknowledgment, a "do you actually want this dependency" call. The legacy daemon implemented this as a polling loop: every minute, check the directives directory for a file matching the spec. If the laptop slept overnight, the loop slept too, and the directive I dropped at 9 PM didn't get picked up until I unlocked the screen the next morning. Worse, if the daemon crashed during the wait, the spec was orphaned and had to be manually re-armed.

Not paying twice for the same Claude call

claude -p calls are not free. The legacy daemon's failure window between "agent finished" and "handoff written + state checkpointed" was small, but real. A crash in that window meant the next start re-dispatched the same agent against the same input — same tokens, same dollars, same wall time, and a non-deterministic risk of getting a slightly different answer that the next agent would now be reasoning against.

Debugging a stuck spec without grep

When a spec got wedged, my debugging tool was grep against handoff files and Python logs. I could see what had happened. I could not easily see what the daemon thought was in flight, which agent it expected to dispatch next, or why it was waiting. Adding a metric meant editing daemon code and restarting, which restarted the very thing I was trying to debug.

Why DBOS was the right answer (and not Temporal, Inngest, or just-write-it-better)

I looked at the alternatives. The shortlist was Temporal, Inngest, and "fix the existing Python daemon harder."

Temporal is, for my context, too much building. It's a separate cluster I have to run, monitor, and reason about. It assumes I have a workflow team. I don't. I have a laptop and a Proxmox box.

Inngest is great if your workflows are HTTP-triggered and short-lived. Mine are long-lived (days, sometimes), triggered by file-system events, and dispatch shell processes. Wrong shape.

Fixing the Python daemon harder meant building, in some form, exactly what DBOS already gives you: a durable execution log, idempotent step replay, and a wait primitive that survives restart. I would have ended up with a worse version of DBOS, written by me in my spare time, debugged in production.

DBOS is a library. You decorate Python functions. The state lives in Postgres tables I already understand. There is no separate service to run. There is no vendor lock-in — if I rip out DBOS tomorrow, my data is still in Postgres rows I can read. That asymmetry is the whole pitch.

What changed for ebatt.ai development

The IWO rebuild is in production-shadow alongside the legacy daemon, running real ebatt.ai specs. Four concrete differences I have measured.

Specs survive my laptop closing. I close the lid, get on a train, open the lid four hours later. Whatever spec was mid-flight when I closed picks up from its last completed step. No reconciliation file to hand-edit. No re-dispatch. The Postgres rows know exactly where the workflow was.

Approval prompts wait for me, not for the daemon's mood. When a Reviewer raises a blocker, the workflow calls DBOS.recv(timeout_seconds=...) and parks. I get a Telegram notification with the question. I respond when I respond. The deadline lives in Postgres. It is correct after a reboot, after a network blip, after the daemon process being killed and restarted by systemd.

No double LLM spend. Every claude -p dispatch is wrapped in a checkpointed step. If the workflow crashes after Claude returns but before the handoff is on disk, the replay on next startup sees the cached step output, skips the re-dispatch, and proceeds. I went from "occasionally pay for the same agent twice" to "never". Across a busy week of ebatt.ai work, that's a measurable saving.

Debugging is SQL. When I want to know what's stuck, I run a SELECT against the workflow status table. I can see every workflow, its current step, the timestamp it last advanced, and the inputs it received. I have a small dashboard for this now. It took me an evening to build because the underlying data is just rows.

The pattern, generalised

Strip out the IWO specifics and the architecture is small:

Workflow body is deterministic glue. No I/O directly in the workflow function. Every side effect is a step.
Steps are idempotent and checkpointed. A step writes its output to Postgres before returning. A replay sees the cached output and skips the work.
Human gates are first-class. The wait primitive is durable — the deadline and the buffered response live in Postgres.
State is queryable. Debugging is SELECT, not stack traces.

This is not unique to multi-agent build pipelines. It applies to any long-running orchestration where the cost of a wrong replay (duplicate email, double charge, wasted LLM spend) is greater than the cost of one extra Postgres row per step. Which is most of them.

What it took to get here

The IWO rebuild has taken about three weeks of evening work, in phases, with the legacy daemon still running in production the whole time.

Phase 0: agree the scope, draft the plan, set up the Postgres schema DBOS provisions automatically.
Phase 1: port the state engine — what was a Python class became DBOS-decorated functions.
Phase 2: port the dispatcher abstraction and the human-gate primitive. 9/9 unit tests passing. This is where DBOS earned its place — DBOS.recv replaced about 200 lines of polling loop and timeout bookkeeping with one call.
Phase 3 (in flight): real claude -p dispatch via tmux panes, shadow-run against live ebatt.ai specs.

When the shadow-run has a week of clean production behind it, I cut over by repointing the launcher. The legacy daemon stays in the tree as a one-week rollback. After that, it's deleted.

What I would tell someone starting this today

If you're running an agentic build loop at any scale and you've felt any of the four pains above, the rebuild is worth it. Not because DBOS is magic, but because the failure modes of the in-memory daemon get worse the more you rely on it. Every spec you run is an opportunity to lose state at a checkpoint that wasn't there. The fix is not better discipline. The fix is durable execution.

If you're not yet at that scale, it's fine to stay on the simpler thing for now. The cutover is doable when you need it. What I would not do is build the durability primitives myself. The combinatorics of "what happens if this step crashes between these two writes" are nasty, and I am not smarter than the people who wrote DBOS.

The unglamorous truth

Most of what makes the IWO-DBOS rebuild useful is not the AI agents. It's the boring infrastructure underneath them — the durable wait, the idempotent step, the queryable state. The agents are the interesting part for a reader. The plumbing is the part that determines whether the system is one I can trust to run unattended on my laptop while I'm on a train.

Ebatt.ai gets shipped because the plumbing works. The agents are how it gets built; the plumbing is why I can sleep through a build cycle.