This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.
The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.
Here's how agentic AI currently typically do edits:
1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.
This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.
AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.
>IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.
Except that isn't how humans edit documents, and it isn't how LLMs work either.
When a human edits a document, they don't typically "reproduce said document with edits", which I assume you mean read the document and reproduce it from memory. They have the document, either physically printed out, or in a word processor. To make edits they either cross-out and write in the edit, or in a word processor just delete the text and replace it with something better. There's no need to keep the entire document in a human's memory for them to reproduce it from memory.
The same goes for the LLM, it has access to the original document at all times. It can remove sections and replace them.
But the LLM hallucinates.
And if you give a document to a human high on LSD to edit, you might get some weird edits back.