logoalt Hacker News

Fabrice Bellard's TS Zip (2024)

86 pointsby everlieryesterday at 8:26 PM28 commentsview on HN

Comments

omoikaneyesterday at 10:39 PM

Current leader of the Large Text Compression Benchmark is NNCP (compression using neural networks), also by Fabrice Bellard:

https://bellard.org/nncp/

Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).

gmuslerayesterday at 10:42 PM

Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.

show 1 reply
meiselyesterday at 9:43 PM

Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.

show 1 reply
oxag3nyesterday at 10:28 PM

Compression and intelligence reminded me of the https://www.hutter1.net/prize

I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.

wewewedxfgdfyesterday at 9:19 PM

>> The ts_zip utility can compress (and hopefully decompress) text files

Hopefully :-)

show 1 reply
egl2020yesterday at 10:26 PM

When Jeff Dean gets stuck, he asks Bellard for help...

rurbanyesterday at 9:52 PM

So did beat his own leading program from 2019, nncp, finally.

MisterTeayesterday at 9:18 PM

This is something I have been curious about in terms of how an LLM's achieves compression.

I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.

show 2 replies
SnowProblemyesterday at 10:26 PM

I love this because it gets to the heart of information theory. Shannon's foundational insight was that information is surprise. A random sequence is incompressible by definition. But what counts as surprise depends on context, and for text, we know a large amount of it is predictable slop. I suspect there's a lot of room to go along this style of compression. For example, maybe you could store an upfront summary that makes prediction more accurate. Or perhaps you could encode larger sequences or some kind of hierarchical encoding. But this is great.

show 1 reply
shawnzyesterday at 9:34 PM

Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder

dmitrygryesterday at 9:09 PM

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

show 3 replies
benatkinyesterday at 9:21 PM

I propose the name tokables for the compressed data produced by this. A play on tokens and how wild it is.

show 1 reply
jokoonyesterday at 11:01 PM

so barely 2 or 3 times better than xz

not really worth it

publicdebatesyesterday at 9:07 PM

Bellard finally working with his true colleague.