Every time a new model comes out, I'm left guessing what it means for my token budget in order to sustain the quality of output I'm getting. And it varies unpredictably each time. Beyond token efficiency, we need benchmarks to measure model output quality per token consumed for a diverse set of multi-turn conversation scenarios. Measuring single exchanges is not just synthetic, it's unrealistic. Without good cost/quality trade-off measures, every model upgrade feels like a gamble.
That’s the joy of purchasing an intangible and non-deterministic product. The profit margin is completely within the vendor’s control and quality is hard for users to measure.
The company I work for provides all engineering employees with a Claude subscription. My job isn't writing (much) code, and we have Copilot with MS Office, plus multiple internal AI tools on top of that. So I'm free to do low-stakes experiments on Claude without having to worry about hitting my monthly usage limit.
I am finding that for complex tasks, Claude's quality of output varies _tremendously_ with repeated runs of the same model and prompt. For example, last week I wrote up (with my own brain and keyboard) a somewhat detailed plain english spec of a work-related productivity app that I've always wanted but never had the time to write. It was roughly the length of an average college essay. The first thing I asked Claude to do was not write any code, but come up with a more formal design and implementation plan based on the requirements that I gave. The idea was to then hand _that_ to Claude and say, okay, now build it.
I used Opus 4.6 with High reasoning for all of this and did not change any model settings between runs.
The first run was overall _amazing_. It was detailed, well-written, contained everything that I asked for. The only drawback was that I was ambiguous on a couple of points which meant that the model went off and designed something in a way that I wasn't expecting and didn't intend. So I cleared that up in my prompt, and instead of keeping the context and building on what was already there, I started a new chat and had it start again from scratch.
What it wrote the second time was _far_ less impressive. The writing was terse, there was a lot less detail, the pretty dependency charts and various tables it made the first time were all gone. Lots of stuff was underspecified or outright missing.
New chat, start again. Similar results as the second run, maybe a bit worse. It also started _writing code_ which was something I told it NOT to do. At this point I'm starting to panic a little because I'm sure I didn't add, "oh, and make it crappy" to the prompt and I was a little angry about not saving the first iteration since it was fairly close to what I had wanted anyway.
I decided to try one last time and it finally gave me back something within about 95% of the first run in terms of quality, but with all the problems fixed. So, I was (finally) happy with that, and it used that to generate the application surprisingly well, with only a few issues that should not be too hard to fix after the fact.
So I guess 4th time was a charm, and the fare was about $7 in tokens to get there.