According to the CEO of Medium, the reason is because their founder, Ev Williams, was a fan of typography and asked that their software automatically convert two hyphens (--) into a single em-dash. Then since Medium was used as a source for high-quality writing, he believes AI picked up a preference for em-dashes based on this writing.
It’s a real pity to me that em-dashes are becoming so disliked for their association with AI. I have long had a personal soft spot for them because I just like them aesthetically and functionally. I prided myself on searching for and correctly using em, en, and regular dashes, had a Google docs shortcut for turning `- - -` into `—` and more recently created an Obsidian auto-replacement shortcut that turns `-em` into `—`. Guess I’ll just have to use it sparingly and keep my prose otherwise human.
I would think the most obvious explanation is that they are used as part of the watermark to help OpenAI identify text - i.e. the model isn't doing it at all but final-pass process is adding in statistical patterns on top of what the model actually generates (along with words like 'delve' and other famous GPT signatures)
I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.
When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.
I am no grammarian, but I feel like em-dashes are an easy way to tie together two different concepts without rewriting the entire sentence to flow more elegantly. (Not to say that em-dashes are inelegant, I like them a lot myself.)
And so AI models are prone to using them because they require less computation than rewriting a sentence.
My first thought was watermarking. Same for it's affinity for using emojis in bullet lists.
The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.
What we also learned after GPT-3.5 is that, to circumvent the need for new training data, we could simply resort to existing LLMs to generate new, synthetic data. I would not be surprised if the em dash is the product of synthetically generated data (perhaps forced to be present in this data) used for the training of newer models.
Another reason I think attributes to it at least partially is that other languages use em-dashes. Most people use LLMs in English, but that's not the only language they know and many other languages have pretty specific rules and uses for em-dashes. For example, I see em-dashes regularly in local European newspapers, and I would expect those to be written by a human for most part simply because LLM output is not good enough in smaller languages.
I’m now reading Pride and Prejudice (first edition released in 1813) and indeed there are many em dashes. It also includes language patterns the models didn’t pick up (vocabulary, to morrow instead of tomorrow)
Historically I would see far more em-dashes in capital "L" literature than I would in more casual contexts. LLMs assign more weight to literature than to things like reddit comments or Daily Mail articles.
Are people surprised that training biases a distinct style? I'd think it's kind of expected
The conclusion is really a guess unfortunately.
I always figured it was because of training on Wikipedia. I used to hate the style zealots (MOStafarians in humorous wiki-jargon) who obsessively enforced typographic conventions like that. Well I still hate them, but I'm sort of thankful that they inadvertently created an AI-detection marker. I've been expecting the AI slop generators to catch on and revert to hyphens though.
I've been using em-dashes in my own writing for years and it's annoying when I get accused of using AI in my posts. I've since switched to using commas, even though it's not the same.
I wonder what happens to all that 18 century books scanning data. I imagine it stays proprietary and I've heard a lot of the books they scan are destroyed afterwards.
My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.
One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."