Yea, so that part I actually did not overthink - I knew I need strong reasoning and just grabbed opus which is my personal go-to for such tasks and sticked to it as I wanted to avoid too many moving parts.
Would be interesting to compare both the benchmark result as well as the way other models approached the whole refactoring process!