IMHO the benchmarks aren't useful, and ranking among the frontier models is mostly noise. The e...

epistasis • yesterday at 8:35 PM • 0 replies • view on HN

IMHO the benchmarks aren't useful, and ranking among the frontier models is mostly noise. The extra features around the coding agent have a much bigger impact on productivity than having to provide slightly more specification and guidance to the models; a 90% success rate versus a 92% success rate on the tasks I ask it to do is far more influenced by what I say than what the model is capable of.

alt Hacker News