Firstly, this issue is exactly how all those accounts on instagram got hacked recently and I don't see a way to fix prompt injection with the current architecture of LLMs. I strongly suspect it is entirely impossible to achieve.
But, that doesn't mean that all useful actions are forbidden. The important part is identifying maximum and minimum harms. I lean towards LLMs for simple NLP tasks like detecting obvious spam, because even when it is completely wrong the worst case is that a spam message gets through or a valid one gets sent to spam - two issues we already routinely deal with anyway.
I'll play advocatus diaboli for once here.
Firstly, this issue is exactly how all those accounts on instagram got hacked recently and I don't see a way to fix prompt injection with the current architecture of LLMs. I strongly suspect it is entirely impossible to achieve.
But, that doesn't mean that all useful actions are forbidden. The important part is identifying maximum and minimum harms. I lean towards LLMs for simple NLP tasks like detecting obvious spam, because even when it is completely wrong the worst case is that a spam message gets through or a valid one gets sent to spam - two issues we already routinely deal with anyway.