Yeah LOL tell me I'm holding it wrong again. Actually Boris, I am tracking what is happening he...

ctoth • today at 6:20 PM • 5 replies • view on HN

Yeah LOL tell me I'm holding it wrong again. Actually Boris, I am tracking what is happening here. I see it, and I'm keeping receipts[0]. This started with the 4.6 rollout, specifically with the unearned confidence and not reading as much between writes. The flail quotient has gone right the hell up. If your evals aren't showing that then bully for your evals I reckon.

[0]: https://github.com/ctoth/claude-failures

Replies

lambda • today at 6:36 PM

I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.

➕ show 3 replies

malfist • today at 6:30 PM

It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"

➕ show 1 reply

bcherny • today at 8:09 PM

Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.

quietsegfault • today at 6:26 PM

I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.

➕ show 4 replies

iwalton3 • today at 6:28 PM

[dead]

alt Hacker News

Replies