logoalt Hacker News

lukevyesterday at 8:30 PM5 repliesview on HN

There are two possible explanations for this behavior: the model nerf is real, or there's a perceptual/psychological shift.

However, benchmarks exist. And I haven't seen any empirical evidence that the performance of a given model version grows worse over time on benchmarks (in general.)

Therefore, some combination of two things are true:

1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

#1 seems more plausible to me a priori, but if you aren't inclined to believe that, you should be positively intrigued by #2, since it points towards a powerful paradigm shift of how we think about the capabilities of LLMs in general... it would mean there is an "x-factor" that we're entirely unable to capture in any benchmark to date.


Replies

davidsainezyesterday at 9:42 PM

There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.

It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).

show 1 reply
blurbleblurbleyesterday at 9:04 PM

I'm pretty sure this isn't happening with the API versions as much as with the "pro plan" (loss leader priced) routers. I imagine that there are others like me working on hard problems for long periods with the model setting pegged to high. Why wouldn't the companies throttle us?

It could even just be that they just apply simple rate limits and that this degrades the effectiveness of the feedback loop between the person and the model. If I have to wait 20 minutes for GPT-5.1-codex-max medium to look at `git diff` and give a paltry and inaccurate summary (yes this is where things are at for me right now, all this week) it's not going to be productive.

conceptionyesterday at 10:49 PM

The only time Ive seen benchmark nerfing is I saw one see a drop in performance between 2.5 march preview and release.

imiricyesterday at 9:07 PM

Or, 2b: the nerf is real, but benchmarks are gamed and models are trained to excel at them, yet fall flat in real world situations.

show 1 reply
zsoltkacsandiyesterday at 9:06 PM

> The nerf is psychologial, not actual

Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.

show 3 replies