I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.
I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.
Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.
I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.
All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.
Nah it’s all from the pretraining data
All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.
I mean in general I'd rather take slightly inflated estimates than the odd sprint poker stuff where other devs and PMs negotiate hours down and before you know it you're also stuck fixing nitpicky reviewer comments on code that is already good enough and have to send a release at like 7 PM, ofc also without enough tests or even enough manual checks and testing, cause people repeatedly act against their self-interest and try to compress timelines, thinking that that's somehow good for them.
At least with AI that actually does things more quickly, there is a bit more breathing room (introducing AI is easier than changing a given environment).
Aside from that, I wonder how much variety there is in practice: between "Oh yeah, I added that new button while we were in the meeting" and "The new button feature will be ready in Q3 according to the roadmap, once we have sign-off from all the stakeholders."
> the estimates
It doesn't estimate.
It generates tokens that read like estimates associated with the context in its training material.
What would you expect the generator to output instead?
I tend to be cynical about AI companies, but I'm guessing the bad estimates more just come from a complete lack of actual data it could use for that so it's more or less a hallucination.