This is a limitation of LLM i/o which historically is a bit slow due to these sequential user v...

sigmoid10 • today at 7:00 AM • 0 replies • view on HN

This is a limitation of LLM i/o which historically is a bit slow due to these sequential user vs assistant chat prompt formats they still train on. But in principle nothing stops you from feeding/retrieving realtime full duplex input/output from a transformer architecture. It will just get slower as you scale to billions or even trillions of parameters, to the point where running it in the cloud might offer faster end-to-end actions than running it locally. What I could imagine is a small local model running everyday tasks and a big remote model tuning in for messy situations where a remote human might have to take over otherwise.

alt Hacker News