logoalt Hacker News

simonwlast Tuesday at 9:11 PM7 repliesview on HN

This is pretty recent - the survey they ran (99 respondents) was August 18 to September 23 2025 and the field observations (watching developers for 45 minute then a 30 minute interview, 13 participants) were August 1 to October 3.

The models were mostly GPT-5 and Claude Sonnet 4. The study was too early to catch the 5.x Codex or Claude 4.5 models (bar one mention of Sonnet 4.5.)

This is notable because a lot of academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.


Replies

utopiahyesterday at 7:51 AM

> academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.

This is a recurring argument which I don't understand. Doesn't it simply mean that whatever conclusion they did was valid then? The research process is about approximating a better description of a phenomenon to understand it. It's not about providing a definitive answer. Being "an entire model generation" behind would be important if fundamental problems, e.g. no more hallucinations, would be solved but if it's going from incremental changes then most likely the conclusions remain correct. Which fundamental change (I don't think labeling newer models as "better" is sufficient) do you believe invalidate their conclusions in this specific context?

show 2 replies
ActionHankyesterday at 5:17 AM

For what it’s worth I know this is likely intended to read as the new generation of models will somehow better than any paper will be able to gauge, that hasn’t been my experience.

Results are getting worse and less accurate, hell, I even had Claude drop some Chinese into a response out of the blue one day.

show 2 replies
reactordevlast Tuesday at 11:07 PM

I knew in October the game had changed. Thanks for keeping us in the know.

show 1 reply
bboryesterday at 3:34 AM

I’m glad someone else noticed the time frames — turns out the lead author here has published 28 distinct preprints in the past 60 days, almost all of which are marked as being officially published already/soon.

Certainly some scientists are just absurdly efficient and all 28 involved teams, but that’s still a lot.

Personally speaking, this gives me second thoughts about their dedication to truly accurately measuring something as notoriously tricky as corporate SWE performance. Any number of cut corners in a novel & empirical study like this would be hard to notice from the final product, especially for casual readers…TBH, the clickbait title doesn’t help either!

I don’t have a specific critique on why 4 months is definitely too short to do it right tho. Just vibe-reviewing, I guess ;)

show 1 reply
joenot443last Tuesday at 9:16 PM

Thanks Simon - always quick on the draw.

Off your intuition, do you think the same study with Codex 5.2 and Opus 4.5 would see even better results?

show 1 reply
dheeralast Tuesday at 9:13 PM

> academic papers take 6-12 months to come out

It takes about 6 months to figure out how to get LaTeX to position figures where you want them, and then another 6 months to fight with reviewers

show 1 reply
trq126154yesterday at 12:43 AM

[flagged]