This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s...

mnky9800n • yesterday at 3:30 PM • 4 replies • view on HN

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

https://github.com/mnky9800n/zork-bench

Replies

kqr • yesterday at 3:38 PM

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

➕ show 3 replies

WarmWash • yesterday at 3:44 PM

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

➕ show 1 reply

CamperBob2 • yesterday at 5:13 PM

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

doingthehula • yesterday at 7:41 PM

[dead]

alt Hacker News

Replies