Why do AI models use so many em-dashes?

50 points • by ahamez • yesterday at 6:51 AM • 50 comments • view on HN

Comments

My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.

One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."

➕ show 3 replies

spuz • yesterday at 9:45 AM

According to the CEO of Medium, the reason is because their founder, Ev Williams, was a fan of typography and asked that their software automatically convert two hyphens (--) into a single em-dash. Then since Medium was used as a source for high-quality writing, he believes AI picked up a preference for em-dashes based on this writing.

https://youtu.be/1d4JOKOpzqU?si=xXDqGEXiawLtWo5e&t=569

➕ show 4 replies

iansteyn • yesterday at 7:35 AM

It’s a real pity to me that em-dashes are becoming so disliked for their association with AI. I have long had a personal soft spot for them because I just like them aesthetically and functionally. I prided myself on searching for and correctly using em, en, and regular dashes, had a Google docs shortcut for turning `- - -` into `—` and more recently created an Obsidian auto-replacement shortcut that turns `-em` into `—`. Guess I’ll just have to use it sparingly and keep my prose otherwise human.

➕ show 7 replies

sixhobbits • yesterday at 9:38 AM

I would think the most obvious explanation is that they are used as part of the watermark to help OpenAI identify text - i.e. the model isn't doing it at all but final-pass process is adding in statistical patterns on top of what the model actually generates (along with words like 'delve' and other famous GPT signatures)

I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.

When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.

➕ show 2 replies

keiferski • yesterday at 10:10 AM

I am no grammarian, but I feel like em-dashes are an easy way to tie together two different concepts without rewriting the entire sentence to flow more elegantly. (Not to say that em-dashes are inelegant, I like them a lot myself.)

And so AI models are prone to using them because they require less computation than rewriting a sentence.

➕ show 1 reply

0xbadc0de5 • yesterday at 10:05 AM

My first thought was watermarking. Same for it's affinity for using emojis in bullet lists.

xg15 • yesterday at 9:57 AM

The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.

➕ show 1 reply

spidersouris • yesterday at 9:11 AM

What we also learned after GPT-3.5 is that, to circumvent the need for new training data, we could simply resort to existing LLMs to generate new, synthetic data. I would not be surprised if the em dash is the product of synthetically generated data (perhaps forced to be present in this data) used for the training of newer models.

Etheryte • yesterday at 8:51 AM

Another reason I think attributes to it at least partially is that other languages use em-dashes. Most people use LLMs in English, but that's not the only language they know and many other languages have pretty specific rules and uses for em-dashes. For example, I see em-dashes regularly in local European newspapers, and I would expect those to be written by a human for most part simply because LLM output is not good enough in smaller languages.

iddan • yesterday at 9:36 AM

I’m now reading Pride and Prejudice (first edition released in 1813) and indeed there are many em dashes. It also includes language patterns the models didn’t pick up (vocabulary, to morrow instead of tomorrow)

➕ show 1 reply

Fricken • yesterday at 7:46 AM

Historically I would see far more em-dashes in capital "L" literature than I would in more casual contexts. LLMs assign more weight to literature than to things like reddit comments or Daily Mail articles.

➕ show 1 reply

byyoung3 • yesterday at 10:17 AM

Because Sam Altman said so

➕ show 1 reply

kristopolous • yesterday at 9:28 AM

Are people surprised that training biases a distinct style? I'd think it's kind of expected

IshKebab • yesterday at 9:25 AM

The conclusion is really a guess unfortunately.

throwaway81523 • yesterday at 9:05 AM

I always figured it was because of training on Wikipedia. I used to hate the style zealots (MOStafarians in humorous wiki-jargon) who obsessively enforced typographic conventions like that. Well I still hate them, but I'm sort of thankful that they inadvertently created an AI-detection marker. I've been expecting the AI slop generators to catch on and revert to hyphens though.

stonecharioteer • yesterday at 10:03 AM

I've been using em-dashes in my own writing for years and it's annoying when I get accused of using AI in my posts. I've since switched to using commas, even though it's not the same.

➕ show 1 reply

neuroelectron • yesterday at 9:58 AM

I wonder what happens to all that 18 century books scanning data. I imagine it stays proprietary and I've heard a lot of the books they scan are destroyed afterwards.

alt Hacker News

Why do AI models use so many em-dashes?

Comments