logoalt Hacker News

notpushkintoday at 5:41 AM0 repliesview on HN

Or with the newsgroup20 dataset:

  curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz | tar -xzf -
  cd 20_newsgroups
  for f in *; do zstd --train "$f/*" -o "../$f.dict"; done
  cd ..
  for d in *.dict; do
    cat 20_newsgroups/misc.forsale/74150 | zstd -D "$d" | wc -c | tr -d '\n'; echo " $d";
  done | sort | head -n 3
Output:

     422 misc.forsale.dict
     462 rec.autos.dict
     463 comp.sys.mac.hardware.dict
Pretty neat IMO.