logoalt Hacker News

matt-pyesterday at 1:12 AM1 replyview on HN

Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.


Replies

kankerlijeryesterday at 1:52 AM

Well, perhaps this is naive of me from the perspective of not fully understanding the training process. However, at some point, with all available training data having been exhausted, gains with synthetic data exhausted, and a large pool of publicly available AI generated code, at what point is it 'smart' to scrape codebases from what you identify as high quality code based, clean it up to remove identifiers, and use that for training a smaller model?