logoalt Hacker News

wildstrawberrytoday at 4:04 AM1 replyview on HN

Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?


Replies

spidy__today at 6:07 AM

1. I wrote the content as what i want to mention in the documentation and just used AI to polish it so that its easy to understand, is it hard to understand the documentation right now?

2. Have added the link for downloading both the enwik9 slice and the nyc dataset. Apologies I forgot to add it.

You can get it from here - https://github.com/samyak112/pym-particles/blob/main/README....

3. Other than zip i tested it with zstd19, and now that you mentioned LZMA2 and BZIP2

I got results on enwik9 100mb slice as

zstd - 28mb bzip2 - 30mb lzma2 - 26mb

I will mention these and results from ZPAQ in the readme for both files, thanks for pointing them out!!!

But the thing is this neural compression approach cant be used right now, as it takes hours to compress and de compress a 100mb file so not really usable and more of a fun project.

show 1 reply