I think it’s a great idea for a benchmark.
One key difference to ARC in its current iteration is that there is a defined and learnable game physics.
Arc requires generalization based on few examples for problems that are not well defined per se.
Hence ARC currently requires the models that work on it to possess biases that are comparable to the ones that humans possess.