My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.
This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.
And once you get unlucky you can’t unsee it.
> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.
I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.