We’ve done similar work. Use case was identifying pages in an old website that now 404 and where the...

thornton • last Saturday at 11:54 PM • 1 reply • view on HN

We’ve done similar work. Use case was identifying pages in an old website that now 404 and where they should be redirected to.

Basically doc2vec and cosine similarity. Totally nonsensical matching outputs to the point matching on title tag vectors or precis was better so now I’m curious if we just did something wrong…

Replies

gojomo • yesterday at 12:33 AM

If by 'doc2vec' you mean the word2vec-like 'Paragraph Vectors' technique: even though that's a far simpler approach than the transformer embeddings, it usually works pretty well for coarse document similarity. Even the famous word2vec vector-addition operations kinda worked, as illustrated by some examples in the followup 'Paragraph Vector' paper in 2015: https://arxiv.org/abs/1507.07998

So if for you the resulting doc-to-doc similarities seemed nonsensical, there was likely some process error in model training or application.

alt Hacker News

Replies