logoalt Hacker News

hrmtst93837yesterday at 9:14 PM0 repliesview on HN

Focusing on flashy breakthroughs hides the issue that bigger models and merge benchmarks rarely translate to reliability in real codebases. For routine merges, subtle regressions and context quirks matter more than headline progress. Unless evals stress nasty scenarios like multi-file renames with tricky conflicts, the numbers are mostly for show. Progress will plateau until someone tunes for the boring, messy cases that waste dev time.