Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".
Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.
Edit: Obligatory reference to Borges's Tlön: https://en.wikipedia.org/wiki/Tl%C3%B6n,_Uqbar,_Orbis_Tertiu...
If the first example was "monkey wrench" instead of "boiling water", we'd never have seen the article.
A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.
And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.
There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”
I would hope that none of those examples were taking up space in a dictionary.
As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."
To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.
I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!
(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)
Examples of collocation dictionaries:
The name for these are "collocations".
Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.
While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"? So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.
Examples of "obscure" compound words include "list uids", "beg pos", "sync binlog", "gfp mask", "av fetch", "str idx", "seq ptr", "ai family", "fmt vuln", "ai socktype", "curr tok", "nbits set", "ini get", "s1 s2", "in addr", "num get", "res init", "sess ref", and "ai addrlen".
Well I can't even.
"book steaks" is in the list, but I don't think it' real. Maybe it was supposed to be "stack".
I don't think 'Words with spaces' is a thing.
I think maybe the word the author is looking for is 'phrase'
I imagine that languages like german that create composites of nouns have less of a problem with this:
English: cream of mushroom soup
Spanisch: sopa cremosa de champiñones
German: Champignoncremesuppe
This boils down to an "is Pluto a planet" debate.
We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.
Hah, I wonder how thick a German, Dutch or Afrikaans dictionary would be if it included all possible spaceless compound words. Literally any concept can be compounded together to make a new word.
Roovleisslaghuisinspekteur =
Rooi = red
Vleis = meat
Slag = butcher
Huis = house
Inspekteur = inspector
"Inspector who controls the quality of red meat in butcheries"
>Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.
"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin
“Hospital bills”. That’s very country specific. Also, that’s two words.
Dictionaries are a mixed bag at best. If you apply David Kaplan’s character/content distinction from Demonstratives, you have to ask: should pure indexicals, which are essentially 'contentless' pointers be treated the same way as standard words? Let alone the thousands of rigid designators in this dataset that map directly to specific objects in the real world. At a certain point, is there no room left for encyclopedias?
I got into solving the NYT crossword during Covid. I couldn’t solve a Monday when I started; now I do Mondays downs-only and look forward to Saturdays. Along the way, I developed a sixth sense for when an answer will be more than one word. I’ve thought a lot about it and can’t really describe how I do it. (Some other puzzles clarify if an answer spans multiple words, but I find the ambiguity adds to the fun.)
> Got a word Didn’t
> frozen water → ice boiling water
Freezing water doesn’t have a word. Boiled water does have a word."to be" is a very weird example because that's just the full infinitive of "be" which is definitely in dictionaries: https://www.merriam-webster.com/dictionary/be
These are under-respected for non native English speakers.
Clearly those Irish monks are to blame.
Very cool project! Reminds me Chiang's great short story 'The Truth of Fact, the Truth of Feeling':
> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?
> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”
Fascinating! I’d add “word nerd” to the list to describe the authors.
With Twain in mind, might I suggest we adopt the simple expedient of snake casing such terms.
On another note, I always wished "never mind" was spelled "nevermind"
"Opaque MWE"? Does no one know the word "idiom"?
[dead]
> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter
I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"