forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"
It's mind boggling if you think about the fact they're essential "just" statistical models
It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth
That might be somewhat ungenerous unless you have more detail to provide.
I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.
Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted
The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.