logoalt Hacker News

skybrianyesterday at 11:14 PM4 repliesview on HN

Are these sort of similarity searches useful for classifying text?


Replies

CuriouslyCyesterday at 11:44 PM

Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

neilellistoday at 12:15 AM

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

esafakyesterday at 11:54 PM

You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.

OutOfHereyesterday at 11:32 PM

It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.