The Small World of English

147 points • by michaeld123 • last Tuesday at 3:14 PM • 66 comments • view on HN

Comments

I really enjoyed the article, reading it more from the perspective of what 21st-century lexicography could be, less as a customer of a word game however thoughtfully designed. As a Wiktionary editor (and Android user who's also grown out of bare word-relationship puzzle games) though, it's sad that there seems to be no way to just use the end-product network as a reference, which I would love to do, but I suppose they did spend a million bucks on it.

I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) and yet there are only like 80 people editing on any given day or whatever. In some languages, it's even the best or most updated dictionary available. The barriers to entry and bureaucracy are really not high for HN audience types.

➕ show 4 replies

jcmeyrignac • last Tuesday at 6:27 PM

Nice work! Here is a similar idea: https://wordassociations.net/en

In french, there is a game to build relations with words (they provide a word, and you have to type the most related words): https://www.jeuxdemots.org They reached 677 million of relations in 2024!

dhashe • last Tuesday at 4:31 PM

This is very cool. In puzzlehunts, we often use tools to assist with solving and writing puzzles (the classic example is https://nutrimatic.org ).

Years ago, I wrote a puzzlehunt puzzle that involved navigating through words where an edge existed if the two words formed a common 2-gram (that is, they often appeared one after another in a text dump of Wikipedia).

For example, a fragment of the graph from the puzzle is: mit -> press -> office <- post <- blog.

This work is obviously much more advanced, and it's very cool to see that they managed to make it work with semantic connections. I was able to get away with a much simpler approach since I only cared about 2-grams over a set of about 1000 words (I literally used a grep command over the entire text of the English wikipedia; it took about a day to run).

But the core idea is shared: 1) wanting to build a graph representation of word connections for a puzzle, 2) it being way to much work to do that manually, 3) you would miss a bunch of edges if you did do it manually, so 4) use programming tools to construct a dataset, and then 5) the end result is surprisingly fun for the user because the dataset is comprehensive and it feels really natural.

If anyone is curious, the puzzlehunt puzzle is here: https://dhashe.com/files/puzzles/word-wide-web.pdf

And the solution is here: https://dhashe.com/files/puzzles/word-wide-web-sol.pdf

And a fair warning to anyone unfamiliar with puzzlehunt puzzles: they do not come with instructions and it is very common to get stuck when solving them, especially when solving them alone. You have not completely solved a puzzlehunt puzzle until you extract an answer word or phrase from the puzzle. This one has an extra layer after filling in the words in the graph. Peeking at the solution is encouraged if you get stuck.

➕ show 1 reply

slantaclaus • last Tuesday at 10:28 PM

I remember in college I got all stoned in the library and determined that you could find a semantic pathway using synonyms to relate completely opposite terms with only a few nodes. Completely blew my mind and I still think about it sometimes.

➕ show 3 replies

cadamsdotcom • last Tuesday at 6:15 PM

Such an amazing data set with the amount of curation you’ve done and the care with which it’s been put together.

It’d be highly valuable as a thesaurus API.

➕ show 1 reply

totaldude87 • last Tuesday at 4:50 PM

I was looking for a similar app for my upcoming book! At times it’s very hard to get the word that we are looking for and hope this solves it!

I know this is not related to the app but still wanted to appreciate the thought

permo-w • last Wednesday at 10:00 AM

my pedantry made me write this, and this is by no means a criticism of the overall art of the article, but of the examples given in the first animation, "Batman" does not need 4 jumps to "inspect" (e.g. Batman -> detective -> inspect), and "emerald" doesn't really connect with "foliage" enough for me: I'd suggest it needing a "green" in between for it to really make sense

us-merul • last Tuesday at 5:21 PM

I really liked this article and these types of analyses always capture me. I just had to try out the game then. I nailed the link to "moon" from "rise" on my first try. Then I was a bit let-down for my first real task to get to "chill" starting from "chain." I went first to conglomerate, then corporation, then management... thinking I would at some point encounter "cold," and then "chill". Unfortunately not. Then I tried from chain to something like (my memory is imperfect here), necklace, jewelry, brilliance, glow, tranquil, calm-- and on a couple of other tries, appease, mollify, relax-- but could never get to "chill." I was able to win eventually by appealing to temperature which led me to chill.

Is there anything the user could do to modify the next steps, other than picking a word? Perhaps selecting some sort of valence related to metaphor or meaning? "I want to pick 'pacify', but in the sense of calming down, not to utterly destroy."

➕ show 1 reply

Jordan-117 • last Tuesday at 4:08 PM

How is a largely text-based app 3.47 GB? Is the dictionary/semantic DB just that large or is there other stuff going on?

➕ show 2 replies

suddenlybananas • last Tuesday at 4:06 PM

I don't find many of these transitions very appealing. Sweet to harmony? Seems a stretch. Nightjar to chirring to bombylious? Might as well be gobbledygook.

➕ show 3 replies

rafram • last Tuesday at 5:30 PM

I wanted to try out your app, but I cancelled the download after noticing that it's 3.5 gigabytes. How?! That's by far the biggest iOS app I've ever seen.

➕ show 1 reply

trizoza • last Tuesday at 4:05 PM

Any plans on launching on Android, or simply just browser based web version?

➕ show 1 reply

o11c • last Tuesday at 3:55 PM

Something about this website makes scrolling lag even with Javascript disabled. Firefox 128 on Linux.

Very interesting topic though.

➕ show 1 reply

6stringmerc • last Tuesday at 5:35 PM

Dissociating English terms from their context and focusing on the ease of relationship is a hilariously bad habit that people actively are trained AWAY from using. The nuance of English is absolutely going to break AI because even the example of “strong” relationships are suspect in utility.

Seriously, when is the last time a casual speaker, writer, or translator used “domicile” in place of “house” in your world? It’s an archaic term appropriated into legal jargon. Flattening out language and drawing lines between terms is funny to me.

The only issue is normalizing “Thesaurus bashing” type mentalities - like this - to degrade the value of coherent, purposeful, meaningful use of English. It’s an amalgamation language with extremely difficult fluency. It’s rife with idioms and contradictory emotional context.

Oh well, I can grasp that I tend to yell at clouds when it comes to this sort of thing. It doesn’t change my opinion this is a harmful exercise and probably should not exist. There are few instances where playing a game will actually make one more stupid, but here we are.

➕ show 1 reply

michaeld123 • last Tuesday at 3:14 PM

We built a 1.5M word semantic network where any two words connect in ~6.43 hops (76% connect in ≤7). The hard part wasn't the graph theory—it was getting rich, non-obvious associations. GPT-4's associations were painfully generic: "coffee → beverage, caffeine, morning." But we discovered LLMs excel at validation, not generation. Our solution: Mine Library of Congress classifications (648k of them, representing 125 years of human categorization). "Coffee" appears in 2,542 different book classifications—from "Coffee trade—Labor—Guatemala" to "Coffee rust disease—Hawaii." Each classification became a focused prompt for generating domain-specific associations. Then we inverted the index: which classifications contain both "algorithm" and "fractals"? Turns out: "Mathematics in art" and "Algorithmic composition." This revealed connections like algorithm→Fibonacci→golden ratio that pure co-occurrence or word vectors miss. The "Montreal Effect" nearly tanked the project—geographic contamination where "bagels" spuriously linked to "Expo 67" because Montreal is famous for bagels. We used LLMs to filter true semantic relationships from geographic coincidence. Technical details: 80M API calls, superconnector deprecation (inverse document frequency variant), morphological deduplication. Built for a word game but the dataset has broader applications.

➕ show 3 replies

BlazeNova • last Wednesday at 9:17 AM

[dead]

akudha • last Tuesday at 9:08 PM

What other word games do people enjoy? My favorites on iOS

Alpha Omega

Sticky Terms (I struggle with this)

Typeshift

Blackbar (old, not maintained, but we can still play. Not a game in strict sense, very enjoyable)

➕ show 2 replies

curtisszmania • last Wednesday at 3:18 AM

[dead]