Vision and audio is already in use in multimodal LLMs. So it's possible in the past.

alt Hacker News

bonoboTP • today at 9:05 AM • 0 replies • view on HN