It's surprisingly difficult, and the "obvious" techniques (just do embeddings) don't really work. I wrote about it and did benchmarks here: https://joecooper.me/blog/redundancy/