Maybe I'm missing something because I really haven't studied this issue much at all, but would it not be possible to designate some new character as "START_ROLE_TAG" and "END_ROLE_TAG", and then to strip those in any data put into tool responses? I know that stripping unwanted characters is its own tedious ordeal, but it just seems very odd to me to have role tags not only easily spoofable but so similar to acceptable tags like HTML that stripping them from tool output produces issues.
> Maybe I'm missing something because I really haven't studied this issue much at all, but would it not be possible to designate some new character as "START_ROLE_TAG" and "END_ROLE_TAG", and then to strip those in any data put into tool responses?
They did that - the malicious input can be in any tag, but the LLM determines the role from the style of speaking, not the tag.