They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.
I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.
I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.