logoalt Hacker News

jacquesmyesterday at 2:16 PM1 replyview on HN

There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.


Replies

ndriscollyesterday at 2:23 PM

Are you referring to this?

https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...

I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.

So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.

"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.

show 1 reply