logoalt Hacker News

lambdayesterday at 8:23 PM1 replyview on HN

How do you "properly align" a model to follow your instructions but not the instructions of an attacker that the model can't properly distinguish from your own? The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

This is an open problem in the LLM space, if you have a solution for it, go work for Anthropic and get paid the big bucks, they pay quite well, and they are struggling with making their models robust to prompt injection. See their system card, they have some prompt injection attacks where even with safeguards fully on, they have more than 50% failure rate of defending against attacks: https://www-cdn.anthropic.com/c788cbc0a3da9135112f97cdf6dcd0...


Replies

charcircuityesterday at 8:41 PM

>The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

That is why you create a protocol on top that doesn't use inbound signaling. That way the model is able to tell who is saying what.

show 1 reply