logoalt Hacker News

Half million 'Words with Spaces' missing from dictionaries

76 pointsby gligierkolast Monday at 5:15 PM126 commentsview on HN

Comments

voidUpdatetoday at 9:46 AM

> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter

I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"

show 7 replies
harperleetoday at 11:46 AM

Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".

Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.

Edit: Obligatory reference to Borges's Tlön: https://en.wikipedia.org/wiki/Tl%C3%B6n,_Uqbar,_Orbis_Tertiu...

oh_my_goodnesstoday at 12:16 PM

If the first example was "monkey wrench" instead of "boiling water", we'd never have seen the article.

AlotOfReadinglast Monday at 6:17 PM

A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.

And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.

[0] https://www.oed.com/dictionary/hot-dog_n

[1] https://www.oed.com/dictionary/goodnight_n

show 1 reply
dec0dedab0delast Monday at 5:50 PM

There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”

I would hope that none of those examples were taking up space in a dictionary.

show 5 replies
Shoreltoday at 10:48 AM

As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."

To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.

I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!

(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)

Examples of collocation dictionaries:

https://www.freecollocation.com/

https://ozdic.com/.

show 1 reply
kelseyfroglast Monday at 6:06 PM

The name for these are "collocations".

Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.

show 2 replies
thmpplast Monday at 6:13 PM

While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"? So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.

show 1 reply
WesolyKubeczektoday at 9:52 AM

Examples of "obscure" compound words include "list uids", "beg pos", "sync binlog", "gfp mask", "av fetch", "str idx", "seq ptr", "ai family", "fmt vuln", "ai socktype", "curr tok", "nbits set", "ini get", "s1 s2", "in addr", "num get", "res init", "sess ref", and "ai addrlen".

Well I can't even.

show 1 reply
riffrafftoday at 12:02 PM

"book steaks" is in the list, but I don't think it' real. Maybe it was supposed to be "stack".

danesparzalast Monday at 6:11 PM

I don't think 'Words with spaces' is a thing.

I think maybe the word the author is looking for is 'phrase'

show 4 replies
ndr42last Monday at 5:43 PM

I imagine that languages like german that create composites of nouns have less of a problem with this:

English: cream of mushroom soup

Spanisch: sopa cremosa de champiñones

German: Champignoncremesuppe

show 3 replies
MarkusQlast Monday at 6:26 PM

This boils down to an "is Pluto a planet" debate.

We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.

show 1 reply
beAbUlast Monday at 9:50 PM

Hah, I wonder how thick a German, Dutch or Afrikaans dictionary would be if it included all possible spaceless compound words. Literally any concept can be compounded together to make a new word.

Roovleisslaghuisinspekteur =

Rooi = red

Vleis = meat

Slag = butcher

Huis = house

Inspekteur = inspector

"Inspector who controls the quality of red meat in butcheries"

DonHopkinstoday at 10:24 AM

>Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.

"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin

below43last Monday at 6:04 PM

“Hospital bills”. That’s very country specific. Also, that’s two words.

show 2 replies
speak_plainlylast Monday at 6:21 PM

Dictionaries are a mixed bag at best. If you apply David Kaplan’s character/content distinction from Demonstratives, you have to ask: should pure indexicals, which are essentially 'contentless' pointers be treated the same way as standard words? Let alone the thousands of rigid designators in this dataset that map directly to specific objects in the real world. At a certain point, is there no room left for encyclopedias?

johnhamlinlast Monday at 6:19 PM

I got into solving the NYT crossword during Covid. I couldn’t solve a Monday when I started; now I do Mondays downs-only and look forward to Saturdays. Along the way, I developed a sixth sense for when an answer will be more than one word. I’ve thought a lot about it and can’t really describe how I do it. (Some other puzzles clarify if an answer spans multiple words, but I find the ambiguity adds to the fun.)

show 1 reply
kgwgklast Monday at 5:50 PM

    > Got a word           Didn’t
    > frozen water → ice   boiling water
Freezing water doesn’t have a word. Boiled water does have a word.
show 2 replies
alecbzlast Monday at 6:23 PM

"to be" is a very weird example because that's just the full infinitive of "be" which is definitely in dictionaries: https://www.merriam-webster.com/dictionary/be

happycat5000last Monday at 5:18 PM

These are under-respected for non native English speakers.

show 1 reply
anotherhuelast Monday at 5:46 PM

Clearly those Irish monks are to blame.

grantpittlast Monday at 5:45 PM

Very cool project! Reminds me Chiang's great short story 'The Truth of Fact, the Truth of Feeling':

> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?

> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”

johnhamlinlast Monday at 6:22 PM

Fascinating! I’d add “word nerd” to the list to describe the authors.

aaroninsflast Monday at 6:03 PM

With Twain in mind, might I suggest we adopt the simple expedient of snake casing such terms.

show 1 reply
hmokiguesslast Monday at 6:15 PM

On another note, I always wished "never mind" was spelled "nevermind"

show 1 reply
JackFrlast Monday at 6:15 PM

"Opaque MWE"? Does no one know the word "idiom"?

retr0rocketlast Monday at 5:46 PM

[dead]