Before November 30, 2022 that would have worked, but I think it stopped being reliable sometime betw...

ben_w • today at 12:50 PM • 0 replies • view on HN

Before November 30, 2022 that would have worked, but I think it stopped being reliable sometime between the original ChatGPT and today.

As per dead internet theory, how confident are we that the community which tells us which LLM is safe or unsafe is itself made of real people, and not mostly astroturfing by the owners of LLMs which are biased to promote things for money?

Even DIY testing isn't necessarily enough, deceptive alignment has been shown to be possible as a proof-of-concept for research purposes, and one example of this is date-based: show "good" behaviour before some date, perform some other behaviour after that date.

alt Hacker News