logoalt Hacker News

montebicycleloyesterday at 9:41 PM6 repliesview on HN

> king-man+woman=queen

Is the famous example everyone uses when talking about word vectors, but is it actually just very cherry picked?

I.e. are there a great number of other "meaningful" examples like this, or actually the majority of the time you end up with some kind of vaguely tangentially related word when adding and subtracting word vectors.

(Which seems to be what this tool is helping to illustrate, having briefly played with it, and looked at the other comments here.)

(Btw, not saying wordvecs / embeddings aren't extremely useful, just talking about this simplistic arithmetic)


Replies

jbjbjbjbyesterday at 11:52 PM

Well when it works out it is quite satisfying

India - Asia + Europe = Italy

Japan - Asia + Europe = Netherlands

China - Asia + Europe = Soviet-Union

Russia - Asia + Europe = European Russia

calculation + machine = computer

show 2 replies
Retr0idyesterday at 10:42 PM

I think it's slightly uncommon for the vectors to "line up" just right, but here are a few I tried:

actor - man + woman = actress

garden + person = gardener

rat - sewer + tree = squirrel

toe - leg + arm = digit

gregschlomyesterday at 10:05 PM

Also, as I just learned the other day, the result was never equal, just close to "queen" in the vector space.

show 2 replies
bee_ridertoday at 12:37 AM

Hmm, well I got

    cherry - picker = blackwood
if that helps.
raddanyesterday at 9:51 PM

> is it actually just very cherry picked?

100%

groby_byesterday at 10:45 PM

I think it's worth keeping in mind that word2vec was specifically trained on semantic similarity. Most embedding APIs don't really give a lick about the semantic space

And, worse, most latent spaces are decidedly non-linear. And so arithmetic loses a lot of its meaning. (IIRC word2vec mostly avoided nonlinearity except for the loss function). Yes, the distance metric sort-of survives, but addition/multiplication are meaningless.

(This is also the reason choosing your embedding model is a hard-to-reverse technical decision - you can't just transform existing embeddings into a different latent space. A change means "reembed all")