logoalt Hacker News

EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

54 pointsby matt_dyesterday at 9:01 PM24 commentsview on HN

Comments

orthoxeroxyesterday at 10:34 PM

> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.

show 6 replies
monster_trucktoday at 12:14 AM

I have encountered the opposite of this. All of the latest pro tier models are still fighting for their lives to use powershell correctly, really basic things like quotes, escaping, heredocs. Doesn't matter what I put in agents.md or instruct it to do. You just have to accept the token tax of it stomping on rakes until it figures it out itself and then keep using that session.

It's bad enough that I've considered writing some sort of cursed bash->posh translation layer

Yet it has no issues at all implementing and then writing slopjective-c 3.0

bwestergardyesterday at 9:28 PM

I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.

Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.

But the model that did the best, Qwen-235B, got virtually every problem wrong.

show 1 reply
groaryesterday at 11:52 PM

I guess if you tell codex to build a transpiler from a subset of python to brainfuck, then solve in that subset of python, it would work much better. Would that be cheating?

gverrillatoday at 12:40 AM

"Genuine Reasoning"

__alexsyesterday at 9:28 PM

I had hope we might finally be ushering in a bold new era of programming in Malbolge but apparently that was too optimistic.

simianwordsyesterday at 9:38 PM

I bet I can do better by allowing this: the llm can pull documentation of the language from the web to understand how it works.

If the llm has “skills” for that language, it will definitely increase accuracy.

rubyn00bieyesterday at 11:43 PM

I am not surprised by this, and am glad to see a test like this. One thing that keeps popping up for me when using LLMs is the lack of actual understanding. I write Elixir primarily and I can say without a doubt, that none of the frontier models understand concurrency in OTP/Beam. They look like they do, but they’ll often resort to weird code that doesn’t understand how “actors” work. It’s an imitation of understanding that is averaging all the concurrency code it has seen in training. With the end result being huge amount of noise, when those averages aren’t enough, guarding against things that won’t happen, because they can’t… or they actively introduce race conditions because they don’t understand how message passing works.

Current frontier models are really good at generating boiler plate, and really good at summarizing, but really lack the ability to actually comprehend and reason about what’s going on. I think this sort of test really highlights that. And is a nice reminder that, the LLMs, are only as good as their training data.

When an LLM or some other kind of model does start to score well on tests like this, I’d expect to see better them discovering new results, solutions, and approaches to questions/problems. Compared to how they work now, where they generally only seem to uncover answers that have been obfuscated but are present.

deklesenyesterday at 9:23 PM

Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.

Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.

show 1 reply
shablulmanyesterday at 9:21 PM

[dead]

Heer_Jyesterday at 9:36 PM

[dead]