Sure, I'll bite. Task-level idempotency is not the problem that durable execution platforms are solving. The core problem is the complexity that arises when one part of your async job becomes distributed: the two common ones are distributed runtime (compute) and distributed application state.
Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.
But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.
It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.
There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.
Sure, I'll bite. Task-level idempotency is not the problem that durable execution platforms are solving. The core problem is the complexity that arises when one part of your async job becomes distributed: the two common ones are distributed runtime (compute) and distributed application state.
Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.
But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.
It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.
There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.