Disclaimer: I did not test this yet. I don't want to make big generalizations. But one thing ...

lalassu • yesterday at 7:53 PM • 5 replies • view on HN

Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.

Replies

msp26 • yesterday at 8:15 PM

K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.

If it had vision and was better on long context I'd use it so much more.

vorticalbox • yesterday at 8:01 PM

This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.

➕ show 1 reply

not_that_d • yesterday at 8:07 PM

What is "Vibe testing"?

➕ show 3 replies

make3 • yesterday at 8:02 PM

I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.

catigula • yesterday at 8:32 PM

This is why I stopped bothering checking out these models and, funnily enough, grok.

alt Hacker News

Replies