logoalt Hacker News

gojomoyesterday at 12:51 AM1 replyview on HN

Not sure you can judge whether these modern models do well on the 'arithmetic analogization' task based on absolute similarity values – & especially L2 distances.

That it ever worked was simply that, among the universe of candidate answers, the right answer was closer to the arithmetic-result-point than other candidates – not necessarily close on any absolute scale. Especially in higher dimensions, everything gets very angularly far from everything else - the "curse of dimensionality".

But the relative differences may still be just as useful/effective. So the real evaluation of effectiveness can't be done with the raw value diff(king-man+woman, queen) alone. It needs to check if that value is less than that for every other alternative to 'queen'.

(Also: canonically these exercises were done as cosine-similarities, not Euclidean/L2 distance. Rank orders will be roughly the same if all vectors normalized to the unit sphere before arithmetic & comparisons, but if you didn't do that, it would also make these raw 'distance' values less meaningful for evaluating this particular effect. The L2 distance could be arbitrarily high for two vectors with 0.0 cosine-difference!)


Replies

jdthediscipleyesterday at 9:59 AM

> It needs to check if that value is less than that for every other alternative to 'queen'.

There you go: Closest 3 words (by L2) to the output vector for the following models, out of the most common 2265 spoken English words among which is also "queen":

    voyage-3-large:             king (0.46), woman (0.47), young (0.52), ... queen (0.56)
    ollama-qwen3-embedding:4b:  king (0.68), queen (0.71), woman (0.81)
    text-embedding-3-large:     king (0.93), woman (1.08), queen (1.13)
All embeddings are normalized to unit length, therefore L2 dists are normalized.