logoalt Hacker News

kankerlijeryesterday at 1:52 AM0 repliesview on HN

Well, perhaps this is naive of me from the perspective of not fully understanding the training process. However, at some point, with all available training data having been exhausted, gains with synthetic data exhausted, and a large pool of publicly available AI generated code, at what point is it 'smart' to scrape codebases from what you identify as high quality code based, clean it up to remove identifiers, and use that for training a smaller model?