There already have been multiple documented cases of LLMs spitting out fairly large chunks of the in...

jacquesm • yesterday at 2:16 PM • 1 reply • view on HN

There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.

Replies

ndriscoll • yesterday at 2:23 PM

Are you referring to this?

https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...

I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.

So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.

"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.

➕ show 1 reply

alt Hacker News

Replies