Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.
Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.
Compression and intelligence reminded me of the https://www.hutter1.net/prize
I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.
>> The ts_zip utility can compress (and hopefully decompress) text files
Hopefully :-)
When Jeff Dean gets stuck, he asks Bellard for help...
So did beat his own leading program from 2019, nncp, finally.
This is something I have been curious about in terms of how an LLM's achieves compression.
I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.
I love this because it gets to the heart of information theory. Shannon's foundational insight was that information is surprise. A random sequence is incompressible by definition. But what counts as surprise depends on context, and for text, we know a large amount of it is predictable slop. I suspect there's a lot of room to go along this style of compression. For example, maybe you could store an upfront summary that makes prediction more accurate. Or perhaps you could encode larger sequences or some kind of hierarchical encoding. But this is great.
Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder
"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.
I propose the name tokables for the compressed data produced by this. A play on tokens and how wild it is.
so barely 2 or 3 times better than xz
not really worth it
Bellard finally working with his true colleague.
Current leader of the Large Text Compression Benchmark is NNCP (compression using neural networks), also by Fabrice Bellard:
https://bellard.org/nncp/
Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).