Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.
Yup. Data compression ≠ semantic compression.
I do not know inner details of Zstandard, but I would expect that it to least do suffix/prefix stats or word fragment stats, not just words and phrases.