logoalt Hacker News

jonatronyesterday at 3:42 PM1 replyview on HN

You could first calculate the distance of the first n bits (eg: 64, one popcountll) as a first pass, then calculate the full distance for candidates over a threshold from the first pass. It makes it approximate, but depending on the application it can be worth it.


Replies

mbreeseyesterday at 4:13 PM

I was thinking of something similar — instead of just two passes, couldn’t you also store different quantized values? If you have thousands of documents, you could narrow it down to a handful with a few bit-wise Hamming comparisons before doing a full cosine similarity on just the rest. If you hand more than one bitmap stored, you’d have fewer comparisons at each step too.

Would this work?