Or with the newsgroup20 dataset:
curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz | tar -xzf -
cd 20_newsgroups
for f in *; do zstd --train "$f/*" -o "../$f.dict"; done
cd ..
for d in *.dict; do
cat 20_newsgroups/misc.forsale/74150 | zstd -D "$d" | wc -c | tr -d '\n'; echo " $d";
done | sort | head -n 3
Output: 422 misc.forsale.dict
462 rec.autos.dict
463 comp.sys.mac.hardware.dict
Pretty neat IMO.