logoalt Hacker News

kamranjontoday at 1:02 PM0 repliesview on HN

I thought this part of the write-up was interesting:

"This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation."

Can't things basically get baked into the weights when trained on enough iterations, and isn't this the basis for a lot of plagiarism issues we saw with regards to code and literature? It seems like this is maybe downplaying the unattributed use of open source code when training these models.