logoalt Hacker News

pbhjpbhjtoday at 2:11 PM1 replyview on HN

You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).

In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.

Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).

In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.

Sometimes you care that it's guessed, sometimes you want just the best guess.


Replies

messetoday at 3:15 PM

> Eire

A nitpick, because it's often a dogwhistle: but almost nobody in Ireland calls it that when speaking English. And that's still incorrect in Irish, the correct spelling is Éire.

show 1 reply