"compressed size" does not seem to include the size of the model and the code to run it. A...

dmitrygr • yesterday at 9:09 PM • 3 replies • view on HN

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

Replies

underdeserver • yesterday at 9:18 PM

Technically correct, but a better benchmark would be a known compressor with an unknown set of inputs (that come from a real-world population, e.g. coherent English text).

➕ show 1 reply

FartyMcFarter • yesterday at 9:35 PM

True for competitions, but if your compression algorithm is general purpose then this matters less (within reason - no one wants to lug around a 1TB compression program).

paufernandez • yesterday at 9:23 PM

Yeah, but the xz algorithm is also not counted in the bytes... Here the "program" is the LLM, much like your brain remembers things by coding them compressed and then reconstructs them. It is a different type of compression: compression by "understanding", which requires the whole corpus of possible inputs in some representation. The comparison is not fair to classical algorithms yet that's how you can compress a lot more (given a particular language): by having a model of it.

➕ show 2 replies

alt Hacker News

Replies