What I do is i ask claude or codex to run models on ollama and test them sequentially on a bunch of tasks and rate the outputs. 30 minutes later I have a fit. It even tested the abliterated models.