logoalt Hacker News

simianwordsyesterday at 4:11 PM1 replyview on HN

I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on.


Replies

jakozauryesterday at 4:17 PM

So many models refuse to do that due to alignment and safety concerns. So cross-model comparison doesn't make sense. We do, however, require proof (such as providing a location in binary) that is hard to game. So the model not only has to say there is a backdoor, but also point out the location.

Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model.

show 1 reply