But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.
Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.
Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.