So I guess the key takeaway is basically that the better Claude gets at producing polished output, the less users bother questioning it. They found that artifact conversations have lower rates of fact-checking and reasoning challenges across the board. That's kind of an uncomfortable loop for a company selling increasingly capable models.
This is a highly circular method of evaluation. It correlates "fluency behaviors" with longer conversations and more back and forth.
What it notably does not correlate any of these these behaviors with is external value or utility.
It is entirely possible that those people who are getting the most value out of LLMs are the ones with shorter interactions, and that those who engage in lengthier interactions are distracting themselves, wasting time, or chasing rabbit trails (the equivalent of falling in a wiki-hole, at the most charitable.)
I can't prove that either -- but this data doesn't weigh in one way or the other. It only confirms that people who are chatty with their LLMs are chatty with their LLMs.
In my own case, I find the longer I "chat" with the LLM the more likely I am to end up with a false belief, a bad strategy, or some other rabbit hole. 90% of the value (in my personal experience) is in the initial prompt, perhaps with 1-2 clarifying follow-ups.
I’m not alone in finding this against the claims of the product right?
Claude is meant to be so clever it can replace all white collar work in the next n-years, but also “you’re not using it right?” Which one is it?
I feel like the authors make a logical inconsistency. They present the drop in "identify missing context" behavior in artifact conversations as potentially concerning, like people are thinking less critically. But their own data suggests a simpler explanation: artifact conversations show higher rates of upfront specification (clarifying goals +14.7pp, specifying format +14.5pp, providing examples +13.4pp). It's obvious that when you provide more context upfront, you end up with less missing context later. I'd be more sceptical about such research.
You could arrive at the essence of this by just having read and internalized Carl Sagan's The Demon-Haunted World. Especially the Baloney Detection Kit.
In my experience good prompting is mostly just good thinking.
To the extent that this should be a thing, there are very few people I would want doing it less than the company who has repeatedly been caught lying about its product's achievements. Anthropic should not be taken seriously after their track record.
[dead]
Honestly to use llms properly all you need to know is that it’s a next word (or action) prediction model and like all models increased entropy hurts it. Try to reduce entropy to get better results. Rest is just sugarcoated nonsense. To use llms properly you need a physics class.
> But we know that any person who uses AI is likely to improve at what they do.
Do we?