Looks like it beats everything in the large text compression benchmark for enwik8, but loses to seve...

meisel • yesterday at 9:43 PM • 1 reply • view on HN

Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.

Replies

AnotherGoodName • yesterday at 10:37 PM

It's actually not the best at enwik8 or 9.

The results at https://www.mattmahoney.net/dc/text.html explicitly add the size of the compressor itself to the result. Note the "enwik9+prog" column. That's what it's ranked on.

The reason to do this is that it's trivial to create a compressor that 'compresses' a file to 0 bytes. Just have an executable with a dictionary of enwik9 that writes that out given any input. So we always measure what is effectively the Kolmogorov complexity. The data+program as a whole that produces the result we want.

So those results add in the compressor size. The programs there generally have no dictionary built in or in the case of LLM based compressors, no pre-trained data. They effectively build the model as they process data. Not compressing much at all at the start and slowly compressing better and better as they go. This is why these programs do better and better with larger data sets. They start with 0 knowledge. After a GB or so they have very good knowledge of the corpus of human language.

This program here however is pre-trained and shipped with a model. It's 150MB in size! This means it has 150MB of extra starting knowledge over those models in that list. The top models in that list are the better compressors, they'll quickly out learn and overtake this compressor but they just don't have that headstart.

Of course measuring fairly this should be listed with that 150MB program size added to the results when doing a comparison.

➕ show 1 reply

alt Hacker News

Replies