logoalt Hacker News

michaeld123last Tuesday at 3:14 PM3 repliesview on HN

We built a 1.5M word semantic network where any two words connect in ~6.43 hops (76% connect in ≤7). The hard part wasn't the graph theory—it was getting rich, non-obvious associations. GPT-4's associations were painfully generic: "coffee → beverage, caffeine, morning." But we discovered LLMs excel at validation, not generation. Our solution: Mine Library of Congress classifications (648k of them, representing 125 years of human categorization). "Coffee" appears in 2,542 different book classifications—from "Coffee trade—Labor—Guatemala" to "Coffee rust disease—Hawaii." Each classification became a focused prompt for generating domain-specific associations. Then we inverted the index: which classifications contain both "algorithm" and "fractals"? Turns out: "Mathematics in art" and "Algorithmic composition." This revealed connections like algorithm→Fibonacci→golden ratio that pure co-occurrence or word vectors miss. The "Montreal Effect" nearly tanked the project—geographic contamination where "bagels" spuriously linked to "Expo 67" because Montreal is famous for bagels. We used LLMs to filter true semantic relationships from geographic coincidence. Technical details: 80M API calls, superconnector deprecation (inverse document frequency variant), morphological deduplication. Built for a word game but the dataset has broader applications.


Replies

gagzillalast Tuesday at 4:18 PM

Very cool and fascinating. I wonder if there are other insights that can be drawn from what you've built. Like which two words (or such pairs) have the longest sequence of hops to connect? Or what are the top "superconnectors"? Or if there is a plausible correlation between how well a word is connected to how old it is?

show 1 reply
marviellast Tuesday at 3:50 PM

Thanks for sharing!

Which embedding types did you try? I'm surprised that embeddings weren't able to take you further with this.

show 1 reply