logoalt Hacker News

himata4113today at 2:25 PM0 repliesview on HN

I don't really know how these models really work, but I had a theory that just as the models have limited attention so do the safety layers. I simply populated enough context with 'malicious' text without making the model trip that "wasted" the internal attention budget on tokens early in the prompt completely ignoring all the tokens that were generated after the fact.