logoalt Hacker News

epistasisyesterday at 8:35 PM0 repliesview on HN

IMHO the benchmarks aren't useful, and ranking among the frontier models is mostly noise. The extra features around the coding agent have a much bigger impact on productivity than having to provide slightly more specification and guidance to the models; a 90% success rate versus a 92% success rate on the tasks I ask it to do is far more influenced by what I say than what the model is capable of.