> It seems to me like it's a fundamentally unsolvable architectural issue with LLMs.
Seems solved already? Exactly what the system/user division is about, and if that's not enough for you, use a model that has a developer/system/user divide.
Today's SOTA LLMs have pretty excellent following of these divisions, and the user "instructions", regardless if they're smuggled in, won't override the system ones.
The difficulty comes when you accept completely unreviewed/unchanged user-input as user messages, as your system/developer prompts needs to take this into account. You're better off to kind of whitelist what's possible rather than trying to prevent specific things, but seems that hasn't fully caught on yet.
It feels like people and organizations are still trying to discover what works or not, and there are huge gaps being being left open because there simply isn't enough understanding of the limitations and impact of what they make available to users. We're already seeing it in lots of places, feels like it won't get better before it gets worse.
> whitelist what's possible
Why do you need LLMs in the first place if you are whitelisting possible inputs?
You can use a much simpler and less costly system.
If it was solved, the bug like this would not happen.
It is also not always clear who is the user and how much they should be obeyed
> Today's SOTA LLMs have pretty excellent following of these divisions
Unfortunately "pretty excellent" is different from "perfect." I haven't kept track, but are you certain that given all possible inputs, the user prompt will never override the system prompt?
Those are strong claims, and unless there's been an advancement in the tech, it doesn't seem possible. Reinforcement learning might make it much less likely, but that's different from impossible.