In theory machine vision could extract the coordinates overlayed in the frames of the video.
You're unlikely to get good data from all frames automatically due to the changing background but I'd have thought you could get enough good data to make it work.