If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.
With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.
With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.
This is not scientific at all, just vibes, YMMV.
One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.
I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).
The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…
> It is very much like playing an instrument.
Or it is more like playing a slot machine and you imagine the rest.
I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.
What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
While the gist of what you say is true, it is hard to get very good at treating them as instruments when they keep getting replaced with new, ostensibly-better versions every few months. But those new versions are not strictly better. They are mostly-better while actually having different strengths and weaknesses.
It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.
Yyep.
IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.
BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.
My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]
As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.
But I won't give any creative open-ended tasks to any other model than Claude.
[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...
while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools
The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.
+1.
this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.
totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.
I find opus for planning and sonnet for coding but codex for code review.
Ah, you are entirely correct to pick this one, yes-yes, it was trained for at least two weeks on a 4k nvidia 9000 gpu cluster in Texas, and RLHF - 500 hungry african students, I can distinctly feel their sweat and sensibilities in every token. I would recommend it with a side of XML, you wouldn't believe how well it fills in all the tags, the attrirbutes, the xlmns, and the sheer volume of it! Choice grade model, you have a good eye!
Way to reinstate con in connoisseurship. The advent of smellers is nigh, nigh!
These are the vibes that power vibecoding.
[flagged]
> This is not scientific at all, just vibes, YMMV.
This is the problem.
I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.