Thanks, reply!That is a really difficult problem.
As you said, if Bebop refuses to go home, then the model has to remember the previous state, and the difficulty increases a lot. Usually, this kind of thing would be modeled with Markov rewards, using states and transition probabilities.
It is a fun problem. I really enjoy writing like this because it always gives me something worth thinking about.