logoalt Hacker News

turnsoutyesterday at 2:57 PM5 repliesview on HN

This is probably entirely down to subtle changes to CC prompts/tools.

I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.

Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?


Replies

FfejLyesterday at 3:02 PM

Honest, good-faith question.

Is CC getting better, or are you getting better at using it? And how do you know the difference?

I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.

show 2 replies
billyloyesterday at 3:02 PM

That's why benchmarks are useful. We all suffer from the shortcomings of human perception.

show 2 replies
arcanemachineryesterday at 4:19 PM

The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU

fragebogenyesterday at 3:00 PM

I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.

gpmyesterday at 6:07 PM

I upvoted, but

> Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts?

The article actually links to this fine postmortem by anthropic that demonstrates one way this is possible - software bugs affecting inference: https://www.anthropic.com/engineering/a-postmortem-of-three-...

Another way this is possible is the model reacting to "stimuli", e.g. the hypothesis at the end of 2023 that the (then current) ChatGPT was getting lazy because it was finding out the date was in december and it associated winter with shorter lazier responses.

A third way this is possible is the actual conspiracy version - Anthropic might make changes to make inference cheaper at the expense of the quality of the responses. E.g. quantizing weights further or certain changes to the sampling procedure.