And therefore it scores worse on benchmarks?
Also Claude/Fable models are quite bad at instructions following: https://artificialanalysis.ai/evaluations/ifbench
On some it does yes, also in real usage.
It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.
Also Claude/Fable models are quite bad at instructions following: https://artificialanalysis.ai/evaluations/ifbench