I’m going to presume good faith rather than trolling. Some questions for you:
1. Coding assistants have emerged as as one of the primary commercial opportunities for AI models. As GP pointed out, LWN is the primary discussion for kernel development. If you were gathering training data for a model, and coding assistance is one of your goals, and you know of a primary sources of open source development expertise, would you:
(a) ignore it because it’s in a quaint old format, or
(b) slurp up as much as you can?
2. If you’d previously slurped it up, and are now collating data for a new training run, and you know it’s an active mailing list that will have new content since you last crawled it, would you: (a) carefully and respectfully leave it be, because you still get benefit from the previous content even though there’s now more and it’s up to date, or
(b) hoover up every last drop because anything you can do to get an edge over your competitors means you get your brief moment of glory in the benchmarks when you release?
I train coding models with RLVR because that's what works. There's ~0.000x good signal in mailing lists that isn't in old mailing lists. (and, since I can't reply to the other person, I mean old as in established, it is in no way a dig to lwn).
You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI...