Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004
If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.
That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.
That's how the public perceive it though.
It's useless and never gets better until it suddenly, unexpecty got good enough.
This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.
What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.
What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.