Just because it is performing rather poorly by comparison, it doesn’t mean it isn’t benchmaxxed. It can still be worse than it appears.
It isn't benchmaxxed because they are using human preference as an evaluation.
It isn't benchmaxxed because they are using human preference as an evaluation.