When working at an influencer marketing company a while ago, back when Instagram still allowed pretty much complete access through their API. As we were indexing the entire Instagram universe for our internal tooling, we had this graph traversal setup to crawl Instagram profiles, then each of their followers etc. We’d need to keep track of visited profiles to not loop and had an Apache Storm cluster for the entire scraping pipeline. It worked, but was cumbersome to work with and monitor as well as couldn’t reach our desired throughput.
Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.
People forget that a billion rows isn’t big data anymore.