Rapid MLX team has done some interesting benchmarking that suggests Qwopus 27B is pretty solid. Their tool includes benchmarking features so you can evaluate your own setup.
They have a metric called Model-Harness Index:
MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)
Pardon the silly question, but why do I need this tool versus running the model directly (and SSH’ing in when I’m away from home)?