> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current ...

orthoxerox • yesterday at 10:34 PM • 6 replies • view on HN

> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.

Replies

onoesworkacct • yesterday at 11:37 PM

Unlike AI, you aren't able to regurgitate entire programs and patterns you've seen before.

AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).

But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.

➕ show 1 reply

astrange • today at 12:23 AM

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

It doesn't even prove the models do that. The RLVR environments being mostly Python isn't "training data memorization". That's just the kind of dumb thing people say to sound savvy.

IsTom • yesterday at 11:19 PM

Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.

➕ show 1 reply

andai • yesterday at 11:21 PM

Yeah there seem to be two axes here.

Esolang vs mainstream paradigm.

Popular vs scarce training data.

So you'd want to control for training data (e.g. brainfuck vs Odin?)

And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.

I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)

wavemode • yesterday at 10:46 PM

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.

➕ show 1 reply

iloveoof • yesterday at 10:44 PM

Try MUMPS, widely used but little training data online. Probably less than some esolangs

➕ show 1 reply

alt Hacker News

Replies