This looks like a nice rundown of how to do this with Python's zstd module.
But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).
Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).
Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.
> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results
Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.