Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
Those of us who self-host git repos know that this is not true. Over at ardour.org, we've passed the 1M-unique-IP's banned due to AI trawlers sucking our repository 1 commit at a time. It was killing our server before we put fail2ban to work.
I'm not arguing that the specific steps Github have taken are the right ones. They might be, they might not, but they do help to address the problem. Our choice for now has been based on noticing that the trawlers are always fetching commits, so we tweaked things such that the overall http-facing git repo works, but you cannot access commit-based URLs. If you want that, you need to use our github mirror :)
> Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
I mean...
* Github is owned by Microsoft.
* The reason for this are AI crawlers.
* The reason AI crawlers exist in masses is an absurd hype around LLM+AI technology.
* The reason for that is... ChatGPT?
* The main investor of ChatGPT happens to be...?
Have you noticed significant slowdown and CPU usage from failban with that many banned IPs? I saw it becoming a huge resource hog with far less IPs than that.
you mean AI crawlers from Microsoft, owners of Github?
Surely most AI trawlers have special support for git and just clone the repo once?
Only they haven't started doing this right now. For many years, GitHub has been crippling unauthenticated browsing, doing it gradually to gauge the response. When unauthenticated, code search doesn't work at all and issue search stops working after like, 5 clicks at best.
This is egregious behavior because Microsoft hasn't been upfront about this while they were doing this. Many open source projects are probably unaware that their issue tracker has been walled off, creating headaches unbeknownst to them.