← Writing

2026-06-05 · Chuan Liu

Your Process Will Die. The Task Shouldn't.

A durable execution architecture for long-running agents — and the idea behind it is older than agents

Long-running agents quietly entrust their execution state to the process — which is the least reliable thing in the whole system. The fix is older than agents.

There's a class of engineering problem that almost everyone building AI agents eventually hits, and almost nobody takes seriously before they have to.

A user kicks off a task. It isn't one question and one answer — the agent runs for minutes to tens of minutes, makes hundreds or thousands of model calls, burns real money in tokens. Then, at minute eighteen, the process dies. A deploy, an autoscale event, an OOM, a flaky host. Take your pick.

What evaporates at that moment isn't just time. It's compute you already paid for, intermediate results that were half-produced, progress the user believed was still moving. The whole task goes back to zero.

The naive setup — a synchronous request holding state in memory — falls apart the moment a task gets long. It runs straight into four problems that don't go away.

Processes die: any deploy or fault can zero out a task that's been running for twenty minutes. Results can't be lost: anything produced mid-run that only lives in memory dies with the process. Progress must be visible: the caller can't sit blocked for tens of minutes waiting on a result; it needs status at any moment. Replicas split-brain: the moment you run more than one copy, two processes claim the same task and clobber each other's state.

These four are the same problem wearing four faces. Every one of them comes from a single buried assumption: that execution state belongs to the process. And the process is the least reliable thing in the entire system.

The whole architecture compresses to one sentence: an agent's execution state is a projection of the database, and the process is just disposable compute. The process can die or be swapped at any time. The task never dies.

Once you accept that premise, every design decision gets the same yardstick. Anything that's gone-if-it's-gone is a liability. Anything that matters has to land in durable storage first, with memory as nothing more than its cache.

Structurally it's three layers. A control layer owns how it runs: a model-driven main loop whose one extra job is to trigger a checkpoint every single step. A persistence layer owns not losing anything: snapshot the state, write results incrementally, and use a buffer-and-replay mechanism to survive a crash at the worst possible instant. A delivery layer owns being reliable to the outside world: events aren't POSTed the moment they're produced — they're written to a table first, then shipped asynchronously and retriably by a separate dispatcher.

Anyone can draw the diagram. What actually decides whether it survives production are a handful of less-obvious tradeoffs.

One: a snapshot stores pointers, not cargo

The easy mistake is to dump the entire execution context into the snapshot, which then balloons to tens of megabytes and becomes far too expensive to write every step. The right move is to subtract: store only what's needed to rebuild state — IDs of processed objects, counters, a progress cursor. The big business objects keep only their IDs in the snapshot and get re-fetched from the source on recovery. Do the subtraction, and writing every step finally becomes affordable.

Two: the results table is the single source of truth, and you never trust memory

Every externally-visible number — progress, counts, how many items were produced — is computed live from the results table, never read from the convenient-looking variable in memory. The payoff: when the status endpoint says N items are done, the fetch endpoint returns exactly N rows. Even if some batch failed to persist, the numbers you expose stay honest — no claiming completion while the goods aren't actually there.

Three: buffer-and-replay, so a crash still doesn't lose results

This is my favorite part, and the rule is one line: buffer first, persist second, clear only on success. Results go into an in-memory buffer, then a write is attempted; on success the buffer clears, on failure it's kept and replayed at the next step or on recovery. And that buffer is itself written into the snapshot — so even if the process dies the instant after a failed write, the process that takes over reads the pending work back out of the snapshot and replays it. Normal operation, transient failure, and a hard crash are all caught by the same mechanism.

Four: persistence order is a causal constraint, not an arbitrary sequence

Snapshot must precede results: the snapshot carries that pending buffer, so writing it first means the intent — I'm about to persist this batch — is durable first. Do it the other way, persist results then crash, and the snapshot has no record, recovery doesn't know the batch landed, and you risk duplicates. Results must precede the event: downstream comes to fetch results the instant it sees the event, so every result row has to be committed before the event fires — otherwise downstream pulls a half-baked set, a silent under-delivery that's miserable to debug. And every write is idempotent, so retries, replays, and corrections never create duplicate rows.

Five: concurrency safety comes from layered defenses, not one trick

Each replica has a globally unique identity so two of them can't be mistaken for each other. Claiming a task is an atomic compare-and-swap — only one wins, the loser exits immediately. The holding replica keeps asserting through a heartbeat that this task is still mine, and bows out the moment it notices ownership was stolen. And at the bottom, a database uniqueness constraint is the last line — any duplicate that slips through is simply rejected. Failover then becomes boring, which is the goal: one replica dies, its heartbeat stops, another scans for tasks that are still running but whose heartbeat has gone stale, claims one, rebuilds state from the latest snapshot, and keeps going. Nothing lost, nothing duplicated, counts monotonically aligned.

Six: outbound delivery uses a transactional outbox

Fire an event at the moment it's produced and a crash loses it. So the event is written to an outbox table first, and a separate dispatcher works it through a state machine: claim, deliver, mark done on success, back off and retry on failure with exponential delay, give up at the attempt ceiling or on a definitive client error. With a skip-the-locked-rows claim, multiple dispatcher replicas each take different events and never double-deliver. A few details make or break it: sequence numbers strictly contiguous with no gaps, so downstream can reliably dedupe and order; backoff capped, so a brief downstream hiccup doesn't turn into a retry storm; and the signature signs the exact bytes that get sent, never letting a middle layer re-serialize and break it.

Seven: distinguish nothing-was-there from the-system-broke

Without that distinction, downstream sees completed with zero results and has no way to tell whether to accept it or retry. So at the terminal-state check you look at the accumulated failure count: if the production stage failed broadly, or the input stage couldn't get data at all, the task is honestly marked failed rather than completed, and a corresponding failure event fires. Try to save it with retries first; if it can't be saved, admit it — don't paper over it with an empty result.

Every look-at-my-architecture post should carry some honest negatives, and this one's no exception.

For short tasks this is over-engineering. Putting the whole apparatus around a one-shot Q&A bot is just making work for yourself — re-run it and move on. It bets the entire reliability story on the persistence layer; if your database isn't solid, everything above is a castle in the air. It has write amplification: a write every step is not a low write rate, and you have to capacity-plan for it. And it does not make the model smarter — it only makes the system sturdier. Those are two different things.

So the shape it fits is specific: long-running, valuable-to-produce, can't-be-lost tasks that also need to report progress reliably to the outside world. If your problem is short, stateless, or cheaply recomputable, the answer isn't this architecture — the answer is don't reach for this architecture.

None of this is a new invention. It just takes a few well-worn ideas from distributed systems — snapshot-and-restore, the transactional outbox, atomic claiming, idempotent writes — and recombines them around the grain of an agent.

But I'm increasingly convinced that a long-running AI agent is fundamentally a distributed-systems problem, not a prompting problem. The model handles smart. Reliable has to be earned, one engineering layer at a time. Assume the process can die at any moment, and the system comes out unexpectedly tough.

If you've hit this exact wall, I'd love to hear how you worked around it. If you think the abstraction itself is wrong, I'd love to hear that even more.