I've figured out the issue. Use `wc -c` instead of `du`.
I can repro on my Mac with these steps with either `zstd` or `gzip`:
$ rm -f ksh.zst
$ zstd < /bin/ksh > ksh.zst
$ du -h ksh.zst
1.2M ksh.zst
$ wc -c ksh.zst
1240701 ksh.zst
$ zstd < /bin/ksh > ksh.zst
$ du -h ksh.zst
2.0M ksh.zst
$ wc -c ksh.zst
1240701 ksh.zst
$ rm -f ksh.gz
$ gzip < /bin/ksh > ksh.gz
$ du -h ksh.gz
1.2M ksh.gz
$ wc -c ksh.gz
1246815 ksh.gz
$ gzip < /bin/ksh > ksh.gz
$ du -h ksh.gz
2.1M ksh.gz
$ wc -c ksh.gz
1246815 ksh.gz
When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)
Interesting!
It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.
My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...