> At some point haven't you scraped the whole thing? Git forges will expose a version of e...

marginalia_nu • today at 12:13 PM • 0 replies • view on HN

> At some point haven't you scraped the whole thing?

Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.

Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.

alt Hacker News