logoalt Hacker News

NitpickLawyeryesterday at 6:29 PM16 repliesview on HN

So this year SotA models have gotten gold at IMO, IoI, ICPC and beat 9/10 humans in that atcoder thing that tested optimisation problems. Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.


Replies

tech_kenyesterday at 6:41 PM

In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.

Copying from a comment I made a few weeks ago:

> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.

edit: oh small world! the cited comment was actually a response to you in that other thread :D

show 1 reply
Ianjittoday at 7:34 AM

Historically there has been a gap between the performance of AI in test environments vs the impact in the real world, and that makes people who have been through the cycle a few times cautious extrapolating.

In 2016 Geoffrey Hinton said vision models would put radiologists out of business within 5-10 years. 10 years on there is a shortage of Radiologists in the US and AI hasn't disrupted the industry.

The DARPA grand challenge for autonomous vehicles was won in 2006, 20 years on self driving cars still have limited deployment.

The real world is more complex than computer scientists apprecate.

jugyesterday at 7:06 PM

Even Sam Altman himself thinks we’re in a bubble, and he ought to have a good sense of the wind direction here.

I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.

DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.

It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.

Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.

And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.

show 1 reply
JohnKemenyyesterday at 6:56 PM

There is a clear difference between what OpenAI manages to do with GPT-5 and what I manage to do with GPT-5. The other day I asked for code to generate a linear regression and it gave back a figure of some points and a line through it.

If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.

show 4 replies
paxysyesterday at 8:57 PM

My response simply is that performance in coding competitions such as ICPC is a very different skillset than what is required in a regular software engineering job. GPT-5 still cannot make sense of my company's legacy codebase even if asked to do the most basic tasks that a new grad out of college can figure out in a day or two. I recently asked it to fix a broken test (I had messed with it by changing one single assertion) and it declared "success" by deleting the entire test suite.

show 3 replies
atleastoptimalyesterday at 8:51 PM

People pattern match with a very low-resolution view of the world (web3/crypt/nfts were a bubble because there was hype, so there must be a bubble since AI is hyped! I am very smart) and fail to reckon with the very real ways in which AI is fundamentally different.

Also I think people do understand just how big of a deal AI is but don't want to accept it or at least publicly admit it because they are scared for a number of reasons, least of all being human irrelevance.

noosphryesterday at 11:23 PM

Two days ago I talked to someone in water management about data centers. One of the big players wanted to build a center that consumed as much water as a medium town in semi arid bushland. A week before that it was a substation which would take a decade to source the transformers for. Before that it was buying closed down coal power plants.

I don't know if we're in a bubble for model capabilities, but we are definitely hitting the wall in terms of what the rest of the physical economy can provide.

You can't undo 50 years of deffered maintenance in three months.

show 1 reply
mvieira38yesterday at 8:10 PM

Well, the supposed PhD-level models are still pretty dumb when they get to consumers, so what gives?

nofriendtoday at 3:47 AM

Where these competitions differ from real life is that evaluating a solution is much easier than generating a solution. We're at the point where AI can do a pretty good job of evaluating solutions, which is definitely an impressive step. We're also at the point where AI can generate candidate solutions to problems like these, which is also impressive. But the degree to which that translates to practical utility is questionable.

The sibling commenter compared this to go, but we could go back to comparing it with chess. Deepblue didn't play chess the way a human did. It deployed massive amounts of compute, to look at as many future board states as possible, in order to see which move would work out. People who said that a computer that could play chess as well as a human would be as smart as a human ended up eating crow. These modern AIs are also not playing these competitions the way a human does. Comparing their intelligence to that of a humans is similarly fallacious.

sixtramyesterday at 6:40 PM

The last time I asked for a code review from AI was last week. It added (hallucinated) some extra lines to the code and then marked them as buggy. Yes, it beats humans at coding — great!

show 1 reply
riku_ikiyesterday at 7:01 PM

> So this year SotA models have gotten gold at IMO, IoI, ICPC > Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.

this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks

apwell23yesterday at 8:26 PM

This comment makes me think. What did previous winners of these competition go on to do in their lives? Anything spectacular?

show 1 reply
m3kw9today at 2:04 AM

the wall is how we need to throw trillions of hardware to do "breakthroughs", LLM uses the same algorthm from last few years. We need a new algorthm breakthrough otherwise buying hardware to increase intelligence isn't scalable.

chpatrickyesterday at 9:14 PM

Don't worry, they're just stochastic parrots copying answers from Stack Overflow. ;)

reducesufferingyesterday at 9:50 PM

People are having a tough time coping with what the near future holds for them. It is quite hard for a typical person to imagine how disruptive and exponential coming world events are like Covid showed.

KallDrexxyesterday at 7:07 PM

It's important to look closely at the details of how these models actually do these things.

If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.

Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.

Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.

Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.

show 2 replies