Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

84 points • by spidy__ • last Tuesday at 1:11 PM • 48 comments • view on HN

I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

Comments

userbinator • today at 3:29 AM

Fabrice Bellard may have been the first to do this, 7 years ago: https://news.ycombinator.com/item?id=27244004

➕ show 1 reply

SubiculumCode • today at 3:44 AM

What do those compress to with conventional approaches? For comparison.

I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.

I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?

➕ show 1 reply

whacked_new • today at 5:38 AM

Somewhat related is stavros's method to compress 500KB to something like 50 bytes https://www.stavros.io/posts/compressing-images-with-stable-...

main drawback is that it's not lossless ;-)

but this is great. I hope this actually becomes a format that wraps the weights and transformer module (maybe this can also be NAS-optimized too?). Maybe it would even work for video?

It's like calling gzip but instead of compression level you choose kolmogorov complexity level

➕ show 2 replies

7373737373 • last Tuesday at 3:27 PM

What does it compress the full 1GB file to? http://prize.hutter1.net/

➕ show 3 replies

wildstrawberry • today at 4:04 AM

Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?

➕ show 1 reply

jmspring • today at 4:59 AM

The model is the important part, a huffman code or adaptive huffman or other sorts of encoders would be much better on a dataset based on the model. You need the model to also decode. And on a dataset of sufficient size, embedding the model and the benefit of it's memorization of the file can be offset.

A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.

The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.

I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).

purple-leafy • today at 4:59 AM

Dumb question: can you train a model to predict the next byte of ANOTHER MODEL

So apply this same logic to compressing a bigger model within a smaller model

I know this is absolutely regarded, but humour me please

➕ show 2 replies

jxmorris12 • today at 4:57 AM

Lo and behold, a nice arithmetic coding implementation that wasn't written by an LLM! A sight for sore eyes – a treat, even. Looks like it was written by someone else though.

Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...

➕ show 1 reply

rtpg • today at 4:28 AM

I've had this idea of building a codec that would similarly overfit to specific images. But the codec itself would not be a fixed size transformer... instead you could just mess around with the sizing to get better quality/smaller size.

So the codec would be something like: <header describing image size + transformer layer shape> <transformer data itself>

I've seen experiments where people have a "fixed" pipeline but I think having something more dynamic would work quite well.

➕ show 1 reply

tae0086 • yesterday at 2:11 AM

Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.

➕ show 1 reply

test1072 • today at 5:26 AM

So has anyone tried to you know for example keep constant weights base model and just transmit the data, might be better compression

➕ show 1 reply

totetsu • today at 5:33 AM

if first bit is 1 then decompress to a picture of my cat, else its just Huffman

purple-leafy • last Wednesday at 9:43 AM

That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link

➕ show 1 reply

IncreasePosts • today at 4:56 AM

Isnt this what auto encoders are for?

spidy__ • today at 6:22 AM