I've been recently experimenting with using a Prolog-based DSL [0] as the missing layer: Start with a markdown document, "compile" it into the DSL, so that you obtain an "executable spec". Execution still involves LLMs, so it's not entirely deterministic, but it's probably more reliable than hoping your markdown instructions get interpreted in the right way.