I'm suspicious of their results with regards to tool usage.
It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
You can approach the task in whatever
way you find most effective:
programmatically or directly
by writing files
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.
People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.
The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.
But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.
The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunately