The AIs aren't using emdashes because they're "massively represented in the training data". I don't understand why people think everything in a model output is strictly related to its frequency in pretraining.
They're emdashing because the style guide for posttraining makes it emdash. Just like the post-training for GPT 3.5 made it speak African English and the post-training for 4o makes it say stuff like "it's giving wild energy when the vibes are on peak" plus a bunch of random emoji.
> Just like the post-training for GPT 3.5 made it speak African English
This is a misunderstanding. At best, some people thought that GPT 3.5 output resembled African English.