Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery ...

KellyCriterion • yesterday at 7:47 PM • 2 replies • view on HN

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

Replies

renegat0x0 • yesterday at 9:20 PM

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

➕ show 2 replies

ghm2199 • yesterday at 8:00 PM

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

➕ show 2 replies

alt Hacker News

Replies