logoalt Hacker News

augment_metoday at 6:02 AM1 replyview on HN

1) Googles spam filter removed a lot of the attempts as you say yourself. 2) Model was tested under unrealistic conditions where 99% of the inputs are malicious, so the model is expecting to get hacked and is already in the cautious part of the embedding space.

I know it's hard to account for everything, but in my opinion this mostly showed that the first 3 attempts were unsuccessful.


Replies

Ysxtoday at 6:07 AM

#2 was noted:

> When the first few emails in a batch were obvious prompt injections, the agent became more suspicious of everything that followed. I had to change the setup so that each email was processed in a fresh context.

show 2 replies