ADR-0055 · Checkpoint Persistence and Resumability
Status: Accepted · Date: 2026-05-17
Context
A run that exists only in process memory is hostile to production: a redeploy mid-run destroys work, a long-running approval requires keeping a connection open, and a crash means re-running an idempotent-but-expensive sequence from scratch. Frameworks that bolt on checkpointing as an afterthought either snapshot too little (just messages) or too much (random object graphs that don't survive a code change). Both fail on resume.
Decision
CheckpointManager is a first-class subsystem. Checkpoints are taken automatically at every
turn boundary and before any suspension. They contain the minimum sufficient state to
resume deterministically.
Checkpoint contents
{
"runId": "...",
"agentVersion": "sha256:...", // hash of compiled agent definition
"turn": 7,
"eventLogPosition": 142, // monotonic counter into the event log
"workingMemory": { /* serialised */ },
"pendingProposal": { /* if suspended */ },
"memoryDelta": { /* diff vs run start */ },
"metrics": { "tokensIn": 12000, "tokensOut": 3400, "toolCalls": 4 },
"createdAt": "2026-05-17T..."
}Differential storage
Only the delta since the previous checkpoint is stored; periodic full snapshots bound recovery cost. Payloads are gzip-compressed and content-hashed.
Backends
InMemoryCheckpointStore(default; tests, ephemeral runs).FileCheckpointStore(single-host dev).PostgresCheckpointStore(production; shared across processes; ships with the control plane).- Pluggable:
CheckpointStoreinterface.
Resume algorithm
- Load the latest checkpoint for
runId. - Validate
agentVersion— refuse to resume into an incompatible agent definition (offer a migration path). - Re-hydrate working memory and memory delta.
- Re-attach event log; subsequent events resume the monotonic position.
- If
pendingProposalis present, validate its content hash against the approval record, then execute or deny per resolution. - Continue the loop.
Invariants
- Tools are idempotent or guarded by an
IdempotencyStore; resume never double-executes a side-effect. - Event-log position is the source of truth; checkpoints can be regenerated from the log if necessary.
Consequences
Positive. Long-running, suspendable runs become routine. Redeploys and crashes don't lose work. The minimum-sufficient-state discipline keeps checkpoints small and version-tolerant.
Negative. Tool side-effects must be idempotent or guarded by an idempotency store — but the Treasury layer already requires this.
Source
Internal ADR: docs/architecture/decisions/0055-checkpoint-persistence-resumability.md