Amazing work! I have been working on robotifying operation task for my company - a robot hand and a vision that can complete a task on the monitor just like humans do. Have been toying with openAI vision model to get the mouse coordinates but it’s slow and does not return the correct coordinates always (probably due to LLM not understanding geometry)
Anyhow , looking forward to try your approach with mediapipe. Thanks for the write up and demo, inspirational.