I suspect a lot of the em-dash usage also comes from transcriptions of verbal media. In the spoken word, people use the kinds of asides that elicit an em-dash a lot.
I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.
But I'm also showing off my ignorance with how these machines turn text into tokens in practice.
I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.
But I'm also showing off my ignorance with how these machines turn text into tokens in practice.