Awesome project! I recently ran a (semi-)crowdsourced quality benchmarking for models ≤20b
How do you benchmark them? This would be awesome to implement at the page as well. I will link to this project at https://mlemarena.top/