In my harness i implemented apply_patch just taking unified diffs for patch -p1. I was shocked to see how bad models are at generating them. I started logging diff failures to analyse -
- All models are terrible at generating line numbers for a proper diff, give up on them
- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available
- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc
In my harness, I implemented tool_edit as a subset of Rob Pike’s Sam editor syntax [0].
Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.
[0] https://9p.io/sys/doc/sam/sam.html