In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't tr...

tech_ken • yesterday at 6:41 PM • 1 reply • view on HN

In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.

Copying from a comment I made a few weeks ago:

> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.

edit: oh small world! the cited comment was actually a response to you in that other thread :D

Replies

NitpickLawyer • yesterday at 6:52 PM

> edit: oh small world the cited comment was actually a response to you in that other thread :D

That's hilarious, we must have the same interests since we keep cross posting :D

The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.

The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.

➕ show 2 replies

alt Hacker News

Replies