I've been playing with embeddings and wanted to try out what results the embedding layer will produce based on just word-by-word input and addition / subtraction, beyond what many videos / papers mention (like the obvious king-man+woman=queen). So I built something that doesn't just give the first answer, but ranks the matches based on distance / cosine symmetry. I polished it a bit so that others can try it out, too.
For now, I only have nouns (and some proper nouns) in the dataset, and pick the most common interpretation among the homographs. Also, it's case sensitive.
> king-man+woman=queen
Is the famous example everyone uses when talking about word vectors, but is it actually just very cherry picked?
I.e. are there a great number of other "meaningful" examples like this, or actually the majority of the time you end up with some kind of vaguely tangentially related word when adding and subtracting word vectors.
(Which seems to be what this tool is helping to illustrate, having briefly played with it, and looked at the other comments here.)
(Btw, not saying wordvecs / embeddings aren't extremely useful, just talking about this simplistic arithmetic)
Some of these make more sense than others (and bookshop is hilarious even if it's only the best answer by a small margin; no shade to bookshop owners).
map - legend = Mercator projection
noodle - wheat = egg noodle
noodle - gluten = tagliatelle
architecture - calculus = architectural style
answer - question = comment
shop - income = bookshop
curry - curry powder = cuisine
rice - grain = chicken and rice
rice + chicken = poultry
milk + cereal = grain
blue - yellow = Fiji
blue - Fiji = orange
blue - Arkansas + Bahamas + Florida - Pluto = Grenada
First off, this interface is very nice and a pleasure to use, congrats!
Are you using word2vec for these, or embeddings from another model?
I also wanted to add some flavor since it looks like many folks in this thread haven't seen something like this - it's been known since 2013 that we can do this (but it's great to remind folks especially with all the "modern" interest in NLP).
It's also known (in some circles!) that a lot of these vector arithmetic things need some tricks to really shine. For example, excluding the words already present in the query[1]. Others in this thread seem surprised at some of the biases present - there's also a long history of work on that [2,3].
[1] https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935...
I don't get it but I'm not sure I'm supposed to.
life + death = mortality
life - death = lifestyle
drug + time = occasion
drug - time = narcotic
art + artist + money = creativity
art + artist - money = muse
happiness + politics = contentment
happiness + art = gladness
happiness + money = joy
happiness + love = joy
Here's a challenge: find something to subtract from "hammer" which does not result in a word that has "gun" as a substring. I've been unsuccessful so far.
"man-intelligence=woman" is a particularly interesting result.
This is super neat.
I built a game[0] along similar lines, inspired by infinite craft[1].
The idea is that you combine (or subtract) “elements” until you find the goal element.
I’ve had a lot of fun with it, but it often hits the same generated element. Maybe I should update it to use the second (third, etc.) choice, similar to your tool.
These are pretty good results. I messed around with a dumber and more naive version of this a few years ago[1], and it wasn't easy to get sensinble output most of the time.
As you might expect from a system with knowledge of word relations but without understanding or a model of the world, this generates gibberish which occasionally sounds interesting.
This might be helpful: I haven't implemented it in the UI, but from the API response you can see what the word definitions are, both for the input and the output. If the output has homographs, likeliness is split per definition, but the UI only shows the best one.
Also, if it gets buried in comments, proper nouns need to be capitalized (Paris-France+Germany).
I am planning on patching up the UI based on your feedback.
I've always wondered if there's s way to find which vectors are most important in a model like this. The gender vector man-woman or woman-man is the one always used in examples, since English has many gendered terms, but I wonder if it's possible to generate these pairs given the data. Maybe to list all differences of pairs of vectors, and see if there are any clusters. I imagine some grammatical features would show up, like the plurality vector people-person, or the past tense vector walked-walk, but maybe there would be some that are surprisingly common but don't seem to map cleanly to an obvious concept.
Or maybe they would all be completely inscrutable and man-woman would be like the 50th strongest result.
man - courage = husband
London-England+France=Maupassant
This is super fun. Offering the ranked matches makes it significantly more engaging than just showing the final result.
Interesting: parent + male = female (83%)
Can not personally find the connection here, was expecting father or something.
What about starting with the result and finding set of words that when summed together give that result?
That could be seen as trying to find the true "meaning" of a word.
There was a site like this a few years ago (before all the LLM stuff kicked off) that had this and other NLP functionality. Styling was grey and basic. That’s all I remember.
I’ve been unable to find it since. Does anyone know which site I’m thinking of?
Just use a LLM api to generate results, it will be far better and more accurate than a weird home cooked algorithm
I tried:
-red
and:
red-red-red
But it did not work and did not get any response. Maybe I am stupid but should this not work?
What does it mean when it surrounds a word in red? Is this signalling an error?
cool but not enough data to be useful yet I guess. Most of mine either didn't have the words or were a few % off the answer, vehicle - road + ocean gave me hydrosphere, but the other options below were boat, ship, etc. Klimt almost made it from Mozart - music + painting. doctor - hospital + school = teacher, nailed it.
Getting to cornbread elegantly has been challenging.
shows how bad embeddings are in a practical way
dog - cat = paleolith
paleolith + cat = Paleolithic Age
paleolith + dog = Paleolithic Age
paleolith - cat = neolith
paleolith - dog = hand ax
cat - dog = meow
Wonder if some of the math is off or I am not using this properly
dog+woman = man
That's weird.
fluid + liquid = solid (85%) -- didn't expect that
blue + red = yellow (87%) -- rgb, neat
black + {red,blue,yellow,green} = white 83% -- weird
horse+man
78% male horse 72% horseman
mathematics - Santa Claus = applied mathematics
hacker - code = professional golf
Really?!
man - brain = woman
woman - brain = businesswoman
wine - alcohol = grape juice (32%)
Accurate.
man - intelligence = woman (36%)
woman + intelligence = man (77%)
Oof.
uncle + aunt = great-uncle (91%)
great idea, but I find the results unamusing
doctor - man + woman = medical practitioner
Good to understand this bias before blindly applying these models (Yes- doctor is gender neutral - even women can be doctors!!)
goshawk-cocaine = gyrfalcon , which is funny if you know anything about goshawks and gyrfalcons
(Goshawks are very intense, gyrs tend to be leisurely in flight.)
dog - fur = Aegean civilization (22%)
huh
King-man+woman=Navratilova, who is apparently a Czech tennis player. Apparently, it's very case-sensitive. Cool idea!
Woman + president = man
male + age = female
female + age = male
rice + fish = fish meat
rice + fish + raw = meat
hahaha... I JUST WANT SUSHI!
man + woman = adult female body
it doesn't know the word human
I'm getting Navralitova instead of queen. And can't get other words to work, I get red circles or no answer at all.
The app produces nonsense ... such as quantum - superposition = quantum theory !!!
garden + sin = gardening
hmm...