logoalt Hacker News

tvinktoday at 6:55 AM1 replyview on HN

I guess you.. ask them a bunch of recommendations? I would imagine this would not be incredibly hard to test as a community


Replies

ben_wtoday at 12:50 PM

Before November 30, 2022 that would have worked, but I think it stopped being reliable sometime between the original ChatGPT and today.

As per dead internet theory, how confident are we that the community which tells us which LLM is safe or unsafe is itself made of real people, and not mostly astroturfing by the owners of LLMs which are biased to promote things for money?

Even DIY testing isn't necessarily enough, deceptive alignment has been shown to be possible as a proof-of-concept for research purposes, and one example of this is date-based: show "good" behaviour before some date, perform some other behaviour after that date.