good question! we use exponential binning (map the mouse movements onto a plane with exponentially increasing tick marks https://si.inc/fdm1/exponential_binning.webp) but tried a bunch of other methods (linear creates too many tokens for the model to learn well). Polar coordinates seem like a better solution but empirically didn't work well because the tokens got too coarse too fast.
It’s interesting that you invest in mouse movements vs just targeting a click at X in Y milliseconds. CAD and video games are of course a great reason for this, but I wonder how much typical tool use can be modeled by just next click events.
I’d love to see this sort of thing paired with eye tracking and turned into a general purpose precog predictive tool for computer use … but you probably have many better use cases for your world model!