logoalt Hacker News

regularfrytoday at 2:41 PM1 replyview on HN

The difference in outcome isn't that big but yes, you need to be more rigorous. For instance I've found that the Kimi K2.5 and K2.6 models will comment out failing tests rather than fix a problem they just caused (mistaking them for "pre-existing failures"), so you need to specifically make commented-out tests break the build. I've not personally had that problem with any of the Anthropic or OpenAI models.


Replies

torginustoday at 5:10 PM

I wonder why it's the natural tendency of models to BS or do stuff like this when they don't have the correct answer - it's clear that they can program refusal into them, but for some reason, refusal has to be injected after the fact, and models can't really arrive at the conclusion that they can't answer properly.

show 1 reply