logoalt Hacker News

ahaspelyesterday at 7:16 PM1 replyview on HN

I'm familiar with the Synopticon, which would be fun to structure.

I didn’t do OCR myself, except for the topic index and to fill in a few gaps. I started from existing Wikisource text and then built a pipeline around that: cleaning (headers, hyphenation, etc.), detecting article boundaries, reconstructing sections, and linking things back to the original page images. Most of the effort went into rendering the complex layouts, and handling the cross-linking, not the initial ingestion.

Glad to go into more detail if you’re interested, but that’s the gist of it.


Replies

peterldownsyesterday at 8:01 PM

Ah ok thanks very much!