Anna’s archive has files specifically for training LLMs. But I’d guess the big players secured their share beforehand, by scraping those sites. I have zero proof, it’s just a guess.