Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably...

xurukefi • today at 9:15 PM • 5 replies • view on HN

Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.

Replies

jsheard • today at 9:53 PM

> I figured that they have found an (automated) way to imitate Googlebot really well.

If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.

➕ show 1 reply

Aurornis • today at 9:32 PM

> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

I hope they haven't been stealing cookies from actual users through a botnet or something.

➕ show 1 reply

elzbardico • today at 9:22 PM

> which is, of course, ridiculous.

Why? in the world of web scrapping this is pretty common.

➕ show 1 reply

tonymet • today at 9:21 PM

I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.

➕ show 1 reply

layer8 • today at 9:46 PM

It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.

➕ show 1 reply

alt Hacker News

Replies