> Did the models from 2 years ago produce more bugs, fewer bugs or the same bugs as today's models?
Is anyone actually tracking that with a methodology not prone to fine-tuning? Specifically, I know a lot of the tests have the problem that you can train the AI to pass the test, so a higher score is not indicative of overall higher performance. I'm not actually being rhetorical here to make a point; I'm genuinely interested if anyone has derived a methodology that gives confidence behind these claims.
(Aside: Its not a huge stretch to claim that they're getting better, but it mostly seems anecdotal from this point, or using methods that have the above problem I stated)
I'm assessing my own experience here. I occasionally check new models on some kinds of problems I'm familiar with but that are not common programming challenges, like arrow-based FRP abstractions but written in C# rather than Haskell. I've noticed considerable improvements on their ability to translate such abstractions idiomatically.