I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetc...

hashar • last Friday at 9:36 AM • 4 replies • view on HN

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

Replies

dspillett • last Friday at 10:59 AM

> I do not understand why the scrappers do not do it in a smarter way

If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.

If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.

If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.

> why the scrappers do not do it in a smarter way

A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.

----

[0] the fact this load might be inconvenient to you is immaterial to the scraper

[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.

ACCount37 • last Friday at 9:50 AM

Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.

FieryMechanic • last Friday at 9:40 AM

The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.

➕ show 2 replies

immibis • last Friday at 9:52 AM

Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).

They don't even use the Wikipedia dumps. They're extremely stupid.

Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.

alt Hacker News

Replies