yeahhhh why isnt there a training structure where you play 5000 games, and the reward function is based on doing well in all of them?
I guess its a totaly different level of control: instead of immediately choosing a certain button to press, you need to set longer term goals. "press whatever sequence over this time i need to do to end up closer to this result"
There is some kind of nested multidimensional thing to train on here instead of immediate limited choices