logoalt Hacker News

charcircuityesterday at 5:49 PM1 replyview on HN

If the model is properly aligned then it shouldn't matter if there is an infinite ways for an attacker to ask the model to break alignment.


Replies

lambdayesterday at 8:23 PM

How do you "properly align" a model to follow your instructions but not the instructions of an attacker that the model can't properly distinguish from your own? The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

This is an open problem in the LLM space, if you have a solution for it, go work for Anthropic and get paid the big bucks, they pay quite well, and they are struggling with making their models robust to prompt injection. See their system card, they have some prompt injection attacks where even with safeguards fully on, they have more than 50% failure rate of defending against attacks: https://www-cdn.anthropic.com/c788cbc0a3da9135112f97cdf6dcd0...

show 1 reply