I’ve tested this model on four of my benchmarks:

zone411 • yesterday at 5:10 AM • 2 replies • view on HN

baxtr • yesterday at 10:24 PM

Good stuff!

Is there a reason you change the leaderboard graphs for the third and fourth one?

Also: would be great to have an overview page with a summary over all test, like a total score or similar.

CamperBob2 • yesterday at 9:44 PM

Would be interesting to see the 27B dense Qwen 3.6 model thrown into the mix.

alt Hacker News