logoalt Hacker News

peterldownsyesterday at 7:10 PM2 repliesview on HN

I've been meaning to build ~exactly this experience, but for the 1952 Encyclopedia Brittanica Great Books of the World collection and its experimental index Syntopicon [0]. Would love to know more about how you OCR'd or otherwise ingested and parsed the raw material. I have a physical copy of the books, and I found some samizdat raw-image scans and started working on a custom OCR pipeline, but wondering if maybe I could learn from your approach...

[0] https://en.wikipedia.org/wiki/A_Syntopicon


Replies

ahaspelyesterday at 7:16 PM

I'm familiar with the Synopticon, which would be fun to structure.

I didn’t do OCR myself, except for the topic index and to fill in a few gaps. I started from existing Wikisource text and then built a pipeline around that: cleaning (headers, hyphenation, etc.), detecting article boundaries, reconstructing sections, and linking things back to the original page images. Most of the effort went into rendering the complex layouts, and handling the cross-linking, not the initial ingestion.

Glad to go into more detail if you’re interested, but that’s the gist of it.

show 1 reply
zozbot234yesterday at 7:21 PM

That collection is not in the public domain, AIUI? You might be able to do it for the Harvard Classics, which has a nice collection-wide index of terms. https://en.wikisource.org/wiki/The_Harvard_Classics has links to the scans.

show 1 reply