There's also a lot of benchmark trickery going on, it's becoming harder to see how the lat...

realusername • today at 3:14 AM • 2 replies • view on HN

There's also a lot of benchmark trickery going on, it's becoming harder to see how the latest models really improved.

The top models also seem to have inconsistent performance depending on the time of day and how far we are from the next release.

Replies

bonesss • today at 3:42 AM

I’m an LLM fan, but from an engineering perspective the idea of building atop services that palpably fluctuate in capacity, performance, and capability is nutty.

Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.

➕ show 1 reply

Barbing • today at 3:44 AM

Interesting, Claude might be doing better since I last checked:

https://marginlab.ai/trackers/claude-code-historical-perform...

There were at least a couple of these degradation trackers.

alt Hacker News

Replies