LLMs are extremely capable at problem solving. Presumably because you can autonomously learn a lot of it. But can you somehow account for things like long-term maintainability and code quality (whatever that means) or do you always have to rely on either existing high-quality code-bases (pre-training) or human curated datasets? Since you can't really quantify these properties (as opposed to: the problem is either solved or not), does this restrict autonomous improvement in this area? Are there benchmarks that consider this? Could Claude Mythos create an ultra-quality version of Claude Code or would it still produce something similar to earlier models, which are already over-sufficient in individual problem solving capability.
> Since you can't really quantify these properties (as opposed to: the problem is either solved or not)
I think we could quantify these properties, just not entirely.
One could take a long-term project and analyze how often or which approaches resulted in a refactor. In the same way, we could also quantify designs that resulted in vulnerabilities (that we know of) the most often.
It even wouldn't be impossible to create artificial scenarios. Projects that have an increasing number of requirements, see how many code changes are required, how many bugs result from that. Again, quantifiable to some extent. Probably better than datasets totally lacking something like that.
There probably isn't a public dataset on this, but it wouldn't be impossible.