logoalt Hacker News

chadwebscraperyesterday at 9:19 PM2 repliesview on HN

I appreciate the response! I also agree - happy to add some clarity to this stuff.

Bot protection - this is handled in a few ways, the basic form bypasses most bot protections and that’s what you can use on the site today. For tougher sites, it solves the bot protections (think datadome, Akamai, incapsula).

The consistency part is ongoing, but it’s possible to check the diffs and content extractions and notice if something has changed and “reindex” the site.

100k URLs is a lot! It could support that, but the initial indexing would be heavy. It’s fairly resource efficient (no browsers). For scale, it’s doing about 40k/scrapes a day right now.

Appreciate the comments, happy to dive deeper into the implantation and I agree with everything you’ve said. Still iterating and trying to improve it.


Replies

codingdaveyesterday at 11:48 PM

Re-indexing seem sub-optimal. I can't think of a use case where people care if the design changes. Even some content changes are not going to be interesting. Someone corrected a typo, updated punctuation, that kind of thing... such things are just noise if you are trying to react to content changes.

Your system needs to know not only what changed, but whether or not it matters. Splitting meaningful content from irrelevant noise is exceedingly important. If you know that, you do not need to re-index because you can diff only the meaningful content.

As far as the 100K URLs, each URL has between 200 and 1000 sub-pages beneath the top-level page. They all need to be periodically scanned for updates, while capturing that distinction of noise vs. meaningful change. I've actually got code that does the needed work - it is scaling it up to that level that I didn't want to take on.

I'm not sure what you mean by no browsers. My existing scraper uses headless browsers, in order to capture JavaScript-driven content and navigate through a SPA without having to re-load at every URL change. If you are not using even a headless browser, how are you getting dynamic content?

show 1 reply