Interesting approach. The scraper-vs-site-owner arms race is real. On the flip side of this discus...

asphero • today at 1:12 AM • 2 replies • view on HN

Interesting approach. The scraper-vs-site-owner arms race is real.

On the flip side of this discussion - if you're building a scraper yourself, there are ways to be less annoying:

1. Run locally instead of from cloud servers. Most aggressive blocking targets VPS IPs. A desktop app using the user's home IP looks like normal browsing.

2. Respect rate limits and add delays. Obvious but often ignored.

3. Use RSS feeds when available - many sites leave them open even when blocking scrapers.

I built a Reddit data tool (search "reddit wappkit" if curious) and the "local IP" approach basically eliminated all blocking issues. Reddit is pretty aggressive against server IPs but doesn't bother home connections.

The porn-link solution is creative though. Fight absurdity with absurdity I guess.

Replies

socialcommenter • today at 12:45 PM

Without wanting to upset anyone - what makes you interested in sharing tips for team scraper?

(Overgeneralising a bit) site owners are mostly cting for public benefit whereas scrapers act for their own benefit/for private interests.

I imagine most people would land on team site-owner, if they were asked. I certainly would.

P.S. is the best way to scrape fairly just to respect robots.txt?

➕ show 1 reply

rhdunn • today at 8:57 AM

Plus simple caching to not redownload the same file/page multiple times.

It should also be easy to detect a forejo, gitea, or similar hosting site, locate the git URL and clone the repo.

alt Hacker News

Replies