I train coding models with RLVR because that's what works. There's ~0.000x good signal in mailing lists that isn't in old mailing lists. (and, since I can't reply to the other person, I mean old as in established, it is in no way a dig to lwn).
You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI...
Old scrapes can't have data about new things though; have to continously re-scan to not be stuck with ancient info.
some scrapers might skip out on already-scraped sources, but easy to imagine that some/many just would not bother (you don't know if it's updated until you've checked, after all). And to some extend you do have to re-scrape, if just to find links to the new stuff.