logoalt Hacker News

lalassuyesterday at 7:53 PM5 repliesview on HN

Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.


Replies

msp26yesterday at 8:15 PM

K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.

If it had vision and was better on long context I'd use it so much more.

vorticalboxyesterday at 8:01 PM

This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.

show 1 reply
not_that_dyesterday at 8:07 PM

What is "Vibe testing"?

show 3 replies
make3yesterday at 8:02 PM

I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.

catigulayesterday at 8:32 PM

This is why I stopped bothering checking out these models and, funnily enough, grok.