Spot on. The persistence layer is a huge part of what makes the canvas work.
For failures, we handle it at multiple levels: first, standard retries and fallbacks to alternate models/providers. If that fails, the agents look for alternate approaches to accomplish the same task (e.g. falling back to web search instead of browser use).
For completeness, you can also manually re-run or edit individual blocks if they fail (though the agents may or may not consider this depending on where they are in their flow).