This is an already apparent problem in academia, though not for the reasons the article suggests.
It is not so much that the "tells" of a poor quality work are vanishing, but that even careful scrutiny of a work done with AI is going to become too costly to be done only by humans. One only has so much time to read while, say, in economics journals, the appendices extend to hundreds of pages.
Would love to hear if other fields' journals are experiencing a similar pressure in not only at the extensive margin (no of new submission) but the intensive margin (effort needed to check each work).
With AI, we‘re cargo-culting understanding. We‘re reproducing the surface of having understood something, but we‘re robbing ourselves the time and effort to truly do it.
Ultimately to understand a thing is to do the thing. And to not understand (which is ok!) is to trust others to, proxy measures or not. Agreed that the future of work is in a precarious place: doing less and trusting more only works up to a point.
`simulacrum` is a great word, gotta add that to my vocabulary.
I think this is why middle managers seemed to be the first acolytes to the church of llm supremacy.
It's a weird space in middle management where all of the incentives other than true competency in the role push you to abstract the knowledge work that you're managing, and that abstraction seems to well describable in embedding space.
A corollary of this could be that people interested in Serious Work will never use LLMs. Could be the new "tell".
> The training doesn't evaluate "is the answer true" or "is the answer useful." It's either "is the answer likely to appear in the training corpus" or "is the RLHF judge happy with the answer." We are optimising LLMs to produce output which looks like high quality output.
It's not quite as dire as this. One of the main reasons why LLM's are getting better over time is that they are used themselves to bootstrap the next generation by sifting through the training set to do 'various things' to it.
People often forget that the training corpus contains everything humanity ever produced and anything new humanity will produce will likely come from it as well. Torturing it with current generation models is among the most productive things you can do to improve the next generation systems.
Everybody's output is someone else's input. When you generate quantity by using an LLM, the other person uses an LLM to parse it and generate their own output from their input. When the very last consumer of the product complains, no one can figure out which part went wrong.
I think this is pretty obvious for many of us in the industry. Unfortunately, there is so much money on the table that the big players will shove whatever they want down our throats
It's a funny thing to write, like an article in an old newspaper that aged quickly. I suspect that this will be wildly out of date within 2-3 years.
"They sound very confident," was a warning a gave a lot on a project a year ago, before I gave up trying to get developers to stop blindly trusting the output and submitting things that were just wrong. The documentation of that team went to absolute shit because the developers thought LLMs magically knew everything.
"The simulacrum is never what hides the truth - it is truth that hides the fact that there is none. The simulacrum is true." - Jean Baudrillard
Aligned with the theory of Bullshit Jobs - LLMs expose the fact that the white collar work most of us have been doing at this point were actually bullshit. When LLMs "fake" work, it actually hides the reality that there was no meaningful work here in the first place.
If you have a test that fails 50% times - is that test valuable or not? A 50% failure rate alone looks like a coin toss, but by itself that does not tell us whether the test is noise or whether it is separating bad states from good ones. For a test to be useful it needs to have positive Youden’s statistic (https://en.wikipedia.org/wiki/Youden%27s_J_statistic): sensitivity + specificity - 1. A 50% failure rate alone does not let us calculate sensitivity and specificity.
I can see a similar problem with this article - the author notices that LLMs produce a lot of errors - then concludes that they are useless and produce only simulacrum of work. The author has an interesting observation about how llms disrupt the way we judge knowledge work. But when he concludes that llms do only simulacrum of work - this is where his arguments fail.
"How do you know the output is good without redoing the work yourself?"
Verifying the correctness of solutions is often much easier than finding correct solutions yourself. Examples: Sudoku and most practical problems in just about any field.
-
"The training doesn't evaluate 'is the answer true' or "is the answer useful.'"
Lets pretend RLVF does not exist to give this argument a chance. Then, while the training loop does not validate accuracy directly I guess, the meta-training loop still does. When someone prompts a model, the resulting execution trace shows if the generated answer is correct or not, and this trace is kept for subsequent training runs. The way coding agents are used productively is not: a) generate code with AI and b) run it yourself; its a) ask the AI to do something, including generating the code and running it too, no step b. This naturally creates large training sets of correct and incorrect solutions.
-
"We spent billions to create systems used to perform a simulacrum of work."
Have you even tried using these systems to produce valuable work? How could this possibly be your conclusion after having tried them?
Why is it not more of a scandal that all these anti-AI articles are written, using large language models?
Why is that not an embarrassment for everyone who moans and carps and complains about the craft?
“/reliable-resources-skill Claude, using the list of approved resources, evaluate the report I’m attaching”
I don't really agree with the premise of the article. Sure proxy measures are everywhere. But for knowledge work specifically you can usually check real quality. Of course it's not as extremely easy as "oh this report contains a few spelling errors", but it is doable. If you accepted work purely based on superficial proxy measures you were not fairly evaluating work at all.
>"is the RLHF judge happy with the answer."
Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.
>We've automated ourselves into Goodhart's law.
Yes.
This does not however mean that progress is not being made.
It just means the progress is happening along such dimensions that are completely illegible in terms of the culture of the early XXI century Internet, which is to say in terms of the values of the society which produced it.
[dead]
[dead]
The FUD about LLM's will never get old. The way I know and trust LLM's is the same way a manager would trust their reportees to do good work.
For most tasks, the complexity/time required to verify a task is << the time required to do the task itself. Sure there can be hallucinations on the graph that the LLM made. But LLMs are hallucinating much less than before. And the time to verify is much lower than the time required for a human to do the task.
I wrote a post detailing this argument https://simianwords.bearblog.dev/the-generation-vs-verificat...
The article asserts that the quality of human knowledge work was easier to judge based on proxy measures such as typos and errors, and that the lack of such "tells" in AI poses a problem.
I don't know if I agree with either assertion… I've seen plenty of human-generated knowledge work that was factually correct, well-formatted, and extremely low quality on a conceptual level.
And AI signatures are now easy for people to recognize. In fact, these turns of phrase aren't just recognizable—they're unmistakable. <-- See what I did there?
Having worked with corporate clients for 10 years, I don't view the pre-LLM era as a golden age of high-quality knowledge work. There was a lot of junk that I would also classify as a "working simulacrum of knowledge work."