I suspect the harness (of which AGENTS and skills and similar things) should be abstracted for better overall performance. This article doesn't really go into detail about model preferences, but some other benchmarks show that different models have differnt preferences of how to use certain tools (probably related to their post training material), and it should really be managed invisibly to me as the end user.
Also curious how well LLMs can self-reflect in a loop, in terms of, here's how the previous iteration went, here's what didn't go well, here's feedback from the human, how do I modify the docs I use in a way that I know I'll do better next time.
I know you can somewhat hillclimb via DSPy but that's hard to generalize.
Claude self-reflects and updates based on feedback pretty well these days, but seems to lean on memory more than updating CLAUDE.md. I don't know how well it adheres to memory, but it seems to work sometimes. I don't like how the memory is stored outside of the project directory though.