logoalt Hacker News

dghlsakjgyesterday at 10:11 PM2 repliesview on HN

GPT-4o is 23 months old.

One year ago, the models were only slightly less competent than today. There were models writing entire apps 3 years ago. Competent function writing is basically a given on all models since GPT3.

Much of the progress in the past year has been around the harnesses, MCPs, and skills. The models themselves are not getting better exponentially, if anything the progress is slowing down significantly since the 2023-2024 releases.


Replies

linsomniactoday at 1:07 AM

>One year ago, the models were only slightly less competent than today.

That has not been my experience. This weekend I pointed Claude Code+Opus 4.6+effort=max at a PRD describing a Docusign-like software. The exact same document I gave to Claude Code+Opus 4.5+Ultrathink around 6 months ago.

The touch-ups I needed after it completed implementation was around a tenth that it took with 4.5. It is a pretty startling difference.

show 1 reply
majormajortoday at 2:20 AM

Yeah I've been able to get great Python functions out of everything since the ChatGPT 4 API in early-to-mid 2023.

It takes far less manual prompting to make it have consistent output, work well with other languages, etc. But if you watch the "thinking" logs it looks an awful lot like the "prompt engineering" you'd do by hand back then. And the output for tricky cases still sometimes goes sideways in obviously-naive-ways. The most telling thing in my experience is all the grepping, looping, refining - it's not "I loaded all twenty of these files into context and have such a perfect understanding of every line's place in the big picture that I can suggest a perfect-the-first-time maximally-elegant modification." It's targeted and tactical. Getting really good at its tactics for that stuff, though!

I can get more done now than a year ago because taking me out of the annoying part of that loop is very helpful.

But there's still a very curious gap that the tool that can quickly and easily recognize certain type of bugs if you ask them directly will also happily spit out those sorts of bugs while writing the code. "Making up fake functions" doesn't make it to the user much anymore, but "not going to be robust in production but technically satisfies the prompt" still does, despite it "knowing better" when you ask it about the code five seconds later.