logoalt Hacker News

anuramattoday at 1:27 AM3 repliesview on HN

> issues like prompt injection are unfixable

how is it unfixable? do you mean "there's always a positive chance"?


Replies

dijksterhuistoday at 1:35 AM

normal

    y = f(x)
prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

show 1 reply
windexh8ertoday at 2:44 AM

There is never going to be a non-zero chance with a non-deterministic system. You can put every guard rail in place and there will always be a different way tokens are input to get bad, or subjective, tokens as output.

The findings are sick and disturbing, I hope OpenAI is not only sued for it but also that Sam Altman along with Elon, Dario and Sundar should all be held accountable in front of Congress. All of these assholes have intentionally put sexual content in their models, likely including CSAM, and so if they cannot prove that it isn't part of their training data then maybe they should be able to operate as they are today.

Where is fear mongering Dario now? He loves to drag his trope around about how advanced and dangerous his models are with respect to cyber security. Yet... We never hear him say how dangerous they could be with respect to generation of CSAM! Maybe because that wouldn't help him IPO?

show 1 reply
solid_fueltoday at 1:39 AM

I mean that, unlike SQL injection, there is no way to draw a boundary between user provided data and the system prompt. It can't be done. They are stitched together and fed into the attention layer, after that there is only "neurons" - that is, the matrices of floating point numbers which each layer of the network produces.

You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.

Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.

show 3 replies