The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.
You'd think context-less OCR would prefer interpreting it as a simple hyphen, since that's the most common dash. Seems unlikely any bias would go the other way.