logoalt Hacker News

mc32yesterday at 6:20 PM1 replyview on HN

True but like regular document scanning software there can be errors in detection.


Replies

dleeftinkyesterday at 6:38 PM

Just as with redacted documents (consistently blocked terms) or bad OCR jobs (wrong or missing characters), even if only a certain percentage comes out unmangled it is more readable than having no data at all.

A stable base corpus and some dynamic programming will allow you to clean up the remainder[0].

[0]: http://stackoverflow.com/a/11642687/2449774

show 1 reply