Disclaimer: I did not test this yet.
I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.
I hope it's not the same here.
This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.
I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.
I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.
This is why I stopped bothering checking out these models and, funnily enough, grok.
K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.
If it had vision and was better on long context I'd use it so much more.