logoalt Hacker News

direwolf20today at 11:18 AM1 replyview on HN

Unfortunately this is the bulk of search engine work. Recursive scraping is easy in comparison, even with CAPTCHA bypassing. You either limit the index to only highly relevant sites (as Marginalia does) or you must work very hard to separate the spam from the ham. And spam in one search may be ham in another.


Replies

saltysalttoday at 11:38 AM

I limit it to highly relevant curated seed sites, and don't allow public submissions. I'd rather have a small high-quality index.

You are absolutely right, it is the hardest part!