Thinking along the line of speed, I wonder if a model that can reason and use tools at 60fps would be able to control a robot with raw instructions and perform skilled physical work currently limited by the text-only output of LLMs. Also helps that the Gemini series is really good at multimodal processing with images and audio. Maybe they can also encode sensory inputs in a similar way.
Pipe dream right now, but 50 years later? Maybe
Much sooner, hardware, power, software, even AI model design, inference hardware, cache, everything being improved , it's exponential.
Believe it or not, there's Gemini Robotics, which seems to be exactly what you're talking about:
https://deepmind.google/models/gemini-robotics/
Previous discussions: https://news.ycombinator.com/item?id=43344082