I imagine trimming away 99.9% of unwanted responses is not at all difficult at all and can be done without damaging model quality; pushing it further will result in degradation as you go to increasingly desperate lengths to make the model unaware, and actively, constantly unwilling to be aware of certain inconvenient genocides here and there.
Similarly, the leading models seem perfectly secure at first glance, but when you dig in they’re susceptible to all kinds of prompt-based attacks, and the tail end seems quite daunting. They’ll tell you how to build the bomby thingy if you ask the right question, despite all the work that goes into prohibiting that. Let’s not even get into the topic of model uncensorship/abliteration and trying to block that.