Ok, I'm not an expert here, you most likely are, but just my 2 cents on your response: I would very much argue to not make this magic. e.g:
> take memory snapshots after each step in a workflow
Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.
The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.
The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
Please steal my startup.
FWIW, I think the memory snapshotting idea isn't going to work for most stacks for a few different reasons, but to speak more broadly on API design for durable execution systems, I agree completely. One of the issues with Temporal and Hatchet in its current state is that it currently abstracts concepts that are essential for the developer to understand, like what it means for a workflow to be durable, while the developer is building the system. So you end up discovering a bunch of weird behaviors like "non-determinism error" when starting to test these systems without a good grasp of the fundamentals.
We're investing heavily in separating out some of these primitives that are separately useful and come together in a DE system: tasks, idempotency keys and workflow state (i.e. event history). I'm not sure exactly what this API will look like in its end state, but idempotency keys, durable tasks and event-based histories are independently useful. This is only true of the durable execution side of the Hatchet platform, though; I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing.
Just to continue the idea: you wouldn't be constraining or tagging functions, you would relinquish control to a system that closely guards how you produce side effects. e.g doing a raw HTTP request from a task is prohibited, not intercepted.
> The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.
> The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
It's a tradeoff. People tend to want to use languages they are familiar with, even at the cost of being constrained within them. A naive DSL would not be expressive enough for the turing completeness one needs, so effectively you'd need a new language/runtime. It's far easier to constrain an existing language than write a new one of course.
Some languages/runtimes are easier to apply durable/deterministic constraints too (e.g. WASM which is deterministic by design and JS which has a tiny stdlib that just needs a few things like time and rand replaced), but they still don't take the ideal step you mention - put the durable primitives and their benefits/constraints in front of the dev clearly.