You've missed the point. This isn't engineering, it's gambling.
You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).
That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.
> You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time
This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.
I think that's a pretty bold claim, that it'd be different every time. I'd think the output would converge on a small set of functionally equivalent designs, given sufficiently rigorous requirements.
And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?