For similarly sized models, not looking very good on the slightly-less-benchmaxxed Terminal-Bench 2....

jaen • today at 5:07 PM • 1 reply • view on HN

For similarly sized models, not looking very good on the slightly-less-benchmaxxed Terminal-Bench 2.0:

  Laguna XS.2  33B-A3B params: 30.6
  Qwen 3.6     35B-A3B       : 51.5
  Devstral 2   123B          : 31.2

Quite a huge lead for Qwen... well, at least it's catching up to other smaller Western labs.

megavon • today at 5:16 PM

Need to look at SWEBench-Pro, it's super competitive. Suspect they'll catch up given the longer-tail on TB scores.

➕ show 1 reply

alt Hacker News