For me the main issue with these systems is that its still seen as a special case of backend execution. I think the real value is just admitting that every POST/PUT should kick off a durable execution, but that doesn't seem to match the design, which considers these workflows quite heavy and expensive, and bases its price on it.
What we need is an opinionated framework that doesn't allow you to do anything except durable workflows, so your junior devs stop doing two POSTs in a row thinking things will be OK.
Doesn't Google have a similar type system for stuff like this? I recall an old engineering blog / etc that detailed how they handled this at scale.
This would look like a handler taking an IO token that provides a memoizing get_or_execute function, plus utilities for calling these handlers, correct?
The "constraining functions to only be durable" idea is really interesting to me and would solve the main gotcha of the article.
It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.
There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.