Reactive State Machines: How lazily-rs Drives agent-doc's Cycle State
A finite state machine is one of the oldest tools in software. So why do so many of them end up as a scattered pile of if state.phase != Committed checks pasted across two dozen functions? This post is about a small primitive — lazily's StateMachine — and how it became the single transition authority for agent-doc's per-document cycle state, replacing implicit phase guards with one declared table that a compiler can check and a test can exhaust.
If you've read the lazily architecture tour, you know the library is a reactive graph: Cell → Slot → Signal → Effect, all owned by a Context. The state machine is built out of those same primitives — so a transition doesn't just mutate a variable, it invalidates a reactive graph that anyone can observe.
The bug: a cycle state with no spine
agent-doc drives an AI agent through a response cycle for a markdown session document. Every cycle moves through phases:
PreflightStarted → ResponseCaptured → WriteApplied → Committed
(Abandoned is a terminal)
That phase lives in a per-document JSON sidecar on disk (.agent-doc/state/cycles/<hash>.json) so a crashed or restarted process can recover. Each phase advance is a public mutator — mark_response_captured, mark_write_applied, mark_committed, plus a swarm of bookkeeping mutators that record pending ids, queue heads, and reaped items.
The problem was that the legality of a transition was implicit. It was enforced by:
- a monotonic rank guard (
cycle_phase_rank) that only knew about the five phases, and - hand-written
if state.phase != X { return }checks duplicated across roughly two dozen mutators.
That left two classes of bug:
- Terminal regressions. A bookkeeping
record_*call landing aftermark_committedwould happily write into an already-closed cycle. Each mutator had to remember to guard against it, and some paths forgot. - Lost updates on the non-phase fields. The rank guard protected the phase, but the cycle state carries ~30 other fields (pending ids, queue heads, capture hashes). A
mark_committedracing arecord_pending_done_idsdid load → mutate → save on the same file with no coordination, and one writer's field could silently overwrite the other's.
The atomic-rename persist (#lzsidecaratomic) already fixed torn reads — no reader ever sees a half-written file. But atomic writes don't fix lost updates between two complete load → mutate → save cycles. For that you need a transition authority: one place that decides whether an event is legal before anyone touches disk.
The primitive: a Cell-backed state machine
lazily's StateMachine is deliberately small. It holds two things: a CellHandle<S> for the current state, and a pure transition function Fn(&S, &E) -> Option<S>.
use lazily::{Context, StateMachine};
#[derive(PartialEq, Clone)]
enum Light { Red, Green, Yellow }
enum Tick { Advance }
let ctx = Context::new();
let m = StateMachine::new(&ctx, Light::Red, |s, _: &Tick| match s {
Light::Red => Some(Light::Green),
Light::Green => Some(Light::Yellow),
Light::Yellow => Some(Light::Red),
});
m.send(&ctx, Tick::Advance);
assert_eq!(m.state(&ctx), Light::Green);
The transition function is the entire machine. send evaluates it against the current state; Some(next) updates the cell, None rejects the event (a guard). That's the whole API surface for mutation.
Three properties fall out for free, because the state lives in a reactive Cell:
- Guards are just
None. Illegal transitions are rejected by the transition function returningNone. There's no separate "can I do this?" layer to keep in sync with the "do this" layer. - No-op self-transitions are suppressed. The underlying
Cellhas aPartialEqguard, so a transition that returnsSome(equal_state)is accepted but doesn't invalidate dependents. That makes duplicate terminal events (like a secondCommitted) idempotent rather than noisy. - The state is reactive.
state_handle()exposes the backing cell, so anyctx.computed,ctx.signal, orctx.effectthat reads it automatically recomputes when the machine transitions. No manual notification wiring.
And because lazily ships the same primitive over three execution contexts, you pick the threading model by picking the type:
StateMachine— single-threaded,RefCell-backed, the fast path.ThreadSafeStateMachine—Send + Sync, shares one reactive graph across OS threads.AsyncStateMachine— Tokio-backed;send/statestay synchronous (cells are the sync input layer), while observers likeon_transitionuse async effects.
The fix: one transition table
agent-doc's CyclePhaseMachine wraps ThreadSafeStateMachine<CyclePhase, CycleEvent>. The entire legal lifecycle of a document cycle is now one pure function:
pub fn transition_phase(current: &CyclePhase, event: &CycleEvent) -> Option<CyclePhase> {
match event {
CycleEvent::StartPreflight => Some(CyclePhase::PreflightStarted),
CycleEvent::ResponseCaptured => match current {
PreflightStarted | ResponseCaptured => Some(ResponseCaptured),
WriteApplied | Committed | Abandoned => None, // no backward edge
},
CycleEvent::WriteApplied => match current {
PreflightStarted | ResponseCaptured | WriteApplied => Some(WriteApplied),
Committed | Abandoned => None,
},
CycleEvent::Committed => match current {
PreflightStarted | ResponseCaptured | WriteApplied | Committed => Some(Committed),
Abandoned => None,
},
CycleEvent::Abandoned => match current {
PreflightStarted | ResponseCaptured | WriteApplied => Some(Abandoned),
Committed | Abandoned => None, // terminals are sticky
},
// ...recoverable timeouts + bookkeeping
}
}
Every public mutator now routes through that table first. A mark_committed that lands races-safter a record_pending_done_ids no longer fights over fields — both submit a CycleEvent, and the table is the sole arbiter of what the phase becomes. A bookkeeping event on an already-Committed cycle returns None and the mutator becomes a clean no-op, so terminal regressions are rejected at the boundary instead of papered over per-call.
The crucial design move is what the state machine is not. It is not the source of truth. The durable JSON sidecar remains the crash-recovery log — every process can replay it when the controller is absent, stale, or restarting. The state machine is the pure transition authority: it answers "is this event legal, and if so what's the next phase?" The sidecar is then appended in one serialized job. Crash recovery seeds the machine from the sidecar on startup; sidecar beats stale memory.
Splitting "is this legal?" from "persist this" is what makes the concurrency story tractable. The transition table is pure and trivially exhaustively testable — no filesystem, no async, no mocks. The persistence path stays simple — it only ever writes states the table already accepted.
Reactive observers, not callbacks
Because the phase lives in a lazily cell, observing a transition doesn't mean registering a callback on a custom event bus. It means reading the graph:
// Eager: true exactly while the cycle is committed.
let done = machine.state_is(&ctx, CyclePhase::Committed);
// on-enter / on-exit from a single observer.
machine.on_transition(&ctx, |old, new| {
log::info!("cycle {old:?} -> {new:?}");
});
state_is returns an eager SignalHandle<bool> — it always reflects the current phase without a manual read, so a UI or a gate checking "is this cycle still open?" never sees a stale value. on_transition is just an Effect over the same cell, fired with (old, new) so one observer can dispatch per-state enter/exit logic. Both inherit lazily's memo guard and glitch-freedom for free.
This is the payoff of building a state machine out of reactive primitives rather than next to them: the same graph that invalidates a derived UI value when a Cell changes invalidates a derived gate when the machine transitions. One consistency model, not two.
Why not just an enum and a match?
You can absolutely write transition_phase as a free function over a bare enum and cover most of the value. The reason to reach for the primitive is everything around the table:
- Threading without rewriting. When the cycle state needed to move off one thread and into a shared controller, the path was changing
StateMachinetoThreadSafeStateMachine. The transition function, the tests, and the observers didn't change — theSend + Syncbound is on the wrapper, not on your domain logic. - Observers that stay correct under batching. A
mark_committedthat lands inside actx.batch(...)settles with every other invalidation in one consistent flush. Anon_transitioneffect doesn't fire mid-storm on a half-updated graph. - Idempotency from the cell guard. Terminals are sticky, and a duplicate
Committedis accepted-but-suppressed because thePartialEqcell refuses to invalidate on an equal value. You don't hand-write "is this already the phase?" checks; the primitive does it.
Takeaway
The cycle-state bug wasn't really about a missing lock or a forgotten if. It was about a state machine that had no spine — legality was smeared across two dozen call sites, so it drifted. Giving it a spine meant:
- One pure transition table as the sole authority for what phase comes next.
- A durable journal (the sidecar) that survives crashes and seeds the machine on restart.
- Reactive observers so gates, UI, and logging read the same graph the machine writes.
And because that spine is a lazily ThreadSafeStateMachine, it's the same Cell → Slot → Signal → Effect model the rest of the system already uses. The state machine isn't a special case bolted onto the reactive library — it's what the reactive library looks like when you only need one cell and a pure function over it.
cargo add lazily
Source: lazily-rs on GitHub · docs · spec · crates.io · companion posts: lazily architecture · Slot → Cell → Signal
