> This task is not easily idempotent; it involves writing a ton of intermediate state and queries to determine that a step should not be repeated
The problem with durable execution is that your entire workflow still needs to be idempotent. Consider that each workflow is divided into a sequence of steps that amount to: 1) do work 2) record the fact that work was done. If 2) never happens because the worker falls over, you must repeat 1). Therefore, for each step, "doing work" happens at least once. Given that steps compose, and each execute at least once, it follows that the entire workflow executes at least once. Because it doesn't execute exactly once, everything you write in a durable execution engine must be idempotent.
At that point, the only thing the durable execution engine is buying you is an optimization against re-running some slow tasks. That may be valuable in itself. However, this doesn't change anything about good practices writing async worker tasks.
> that your entire workflow still needs to be idempotent
If just meaning workflow logic, as the article mentions it has to be deterministic, which implies idempotency but that is fine because workflow logic doesn't have side effects. But the side-effecting functions invoked from a workflow (what Temporal dubs "activities") of course _should_ be idempotent so they can be retried upon failure, as is the case for all retryable code, but this is not a requirement. These side effecting functions can be configured at the callsite to have at-most-once semantics.
In addition to many other things like observability, the value of durable execution is persisted advanced logic like loops, try/catch, concurrent async ops, sleeping, etc and making all of that crash proof (i.e. resumes from where it left off on another machine).
> The problem with durable execution is that your entire workflow still needs to be idempotent.
Yes, but what that means depends on your durability framework. For example, the one that my company makes can use the same database for both durability and application data, so updates to application data can be wrapped in the same database transaction as the durability update. This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.
I think a lot of the original temporal/cadence authors were motivated by working on event-driven systems with retries. They exhibited complex failure scenarios that they could not reasonably account for without slapping on more supervisor systems. Durable executions allow you to have a consistent viewpoint to think about failures.
I agree determinism/idempotency and the complexities of these systems are a tough pill to swallow. Certainly need to be suited to the task.