Considerations about what goes on in agents internally will probably not be part of software development for long.
Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.
To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.
What kind of projects/code do you have them work on?
Asking because I could guess that approach would be ok for the types of front end work that doesn't require much security or other validation.
But it sounds like it wouldn't be suitable for work in regulated industries or anything that needs to have extreme care taken.
?
This is an absolutely crazy wasteful thing to do considering the actual cost of all that inference and nothing to be proud of.