swe-rebench is a pretty good indicator. They take "new" tasks every month and test the mod...

NitpickLawyer • 01/03/2026 • 0 replies • view on HN

swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.

alt Hacker News