It's just amazing to me how fast the goal posts are moving. Four years ago, if you had told som...

woeirua • 01/16/2026 • 5 replies • view on HN

It's just amazing to me how fast the goal posts are moving. Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy. The tech is moving so fast. I slept on Opus 4.5 because GPT 5 was kind of an air ball, and just started using it in the past few weeks. It's so good. Way better than almost anything that's come before it. It can one-shot tasks that we never would've considered possible before.

Replies

skue • 01/16/2026

> Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy.

Four years ago, they would have likely asked what in the world is an LLM? ChatGPT is barely 3 years old.

enraged_camel • 01/16/2026

It literally saved my small startup six-figures and months of work. I've written about it extensively and posted it (it's in my submissions).

ranyume • 01/16/2026

There are certain things/llm-phenomena that haven't changed since their introduction.

Madmallard • 01/16/2026

Idk I was using chat gpt 3.5 to do stuff and it was pretty helpful then

utopiah • 01/16/2026

> The tech is moving so fast.

Well that's exactly the problem : how can one say that?

The entire process of evaluating what "it" actually does has been a problem from the start. Input text, output text ... OK but what if the training data includes the evaluation? This was ridiculous few years ago but then the scale went from some curated text datasets to... most of the Web as text, to most of the Web as text including transcription from videos, to most of the Web plus some non public databases, to all that PLUS (and that's just cheating) tests that were supposed to be designed to NOT be present elsewhere.

So again, that's the crux of the problem, WHAT does it actually do? Is it "just" search? Is it semantic search with search and replace, is it that plus evaluation that it runs?

Sure the scaffolding becomes bigger, the available dataset becomes larger, the compute available keeps on increasing but it STILL does not answer the fundamental question, namely what is being done. The assumption here is because the output text does solve the question ask, then "it" works, it "solved" the problem. The problem is that by definition the entire setup has been made in order to look as plausible as possible. So it's not luck that it initially appears realistic. It's not luck that it can thus pass some dedicated benchmark, but it is also NOT solving the problem.

So yes sure the "tech" is moving "so fast" but we still can't agree on what it does, we keep on having no good benchmarks, we keep on having that jagged frontier https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 that makes it so challenging to make more meaningful statement than "moving so fast" which sounds like marketing claims.

➕ show 1 reply

alt Hacker News

Replies