Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.
Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471
Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471