Fabrice Bellard's TS Zip (2024)

86 points • by everlier • yesterday at 8:26 PM • 28 comments • view on HN

Comments

Current leader of the Large Text Compression Benchmark is NNCP (compression using neural networks), also by Fabrice Bellard:

https://bellard.org/nncp/

Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).

gmuslera • yesterday at 10:42 PM

Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.

➕ show 1 reply

meisel • yesterday at 9:43 PM

Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.

➕ show 1 reply

oxag3n • yesterday at 10:28 PM

Compression and intelligence reminded me of the https://www.hutter1.net/prize

I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.

wewewedxfgdf • yesterday at 9:19 PM

>> The ts_zip utility can compress (and hopefully decompress) text files

Hopefully :-)

➕ show 1 reply

egl2020 • yesterday at 10:26 PM

When Jeff Dean gets stuck, he asks Bellard for help...

rurban • yesterday at 9:52 PM

So did beat his own leading program from 2019, nncp, finally.

MisterTea • yesterday at 9:18 PM

This is something I have been curious about in terms of how an LLM's achieves compression.

I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.

➕ show 2 replies

SnowProblem • yesterday at 10:26 PM

I love this because it gets to the heart of information theory. Shannon's foundational insight was that information is surprise. A random sequence is incompressible by definition. But what counts as surprise depends on context, and for text, we know a large amount of it is predictable slop. I suspect there's a lot of room to go along this style of compression. For example, maybe you could store an upfront summary that makes prediction more accurate. Or perhaps you could encode larger sequences or some kind of hierarchical encoding. But this is great.

➕ show 1 reply

shawnz • yesterday at 9:34 PM

Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder

dmitrygr • yesterday at 9:09 PM

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

➕ show 3 replies

benatkin • yesterday at 9:21 PM

I propose the name tokables for the compressed data produced by this. A play on tokens and how wild it is.

➕ show 1 reply

jokoon • yesterday at 11:01 PM

so barely 2 or 3 times better than xz

not really worth it

publicdebates • yesterday at 9:07 PM

Bellard finally working with his true colleague.

alt Hacker News

Fabrice Bellard's TS Zip (2024)

Comments