I see what you are saying, and perhaps we are zeroing in on the importance of ground truths (even if it is not code but rather PLANs or other docs).
For what you're saying to work, then the LLM must adhere consistently to that initial prompt. Different LLMs and the same LLM on different runs might have different adherence and how does it evolve from there? Meaning at playback of prompt #33, will the ground truth gonna be the same and the next result the same as in the first attempt?
If this is local LLM and we control all the context, then we can control that LLM's seeds and thus get consistent output. So I think your idea would work well there.
I've not started keeping thinking traces, as I'm mostly interested in how humans are using this tech. But, that could get involved in this as well, helping other LLMs understand what happened with a project up to a state.