logoalt Hacker News

TZubiritoday at 7:50 AM4 repliesview on HN

forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"


Replies

ComplexSystemstoday at 4:37 PM

The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

ffsm8today at 8:18 AM

It's mind boggling if you think about the fact they're essential "just" statistical models

It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

show 2 replies
mikaraentotoday at 8:30 AM

That might be somewhat ungenerous unless you have more detail to provide.

I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

show 1 reply
efskaptoday at 8:34 AM

Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

show 1 reply