logoalt Hacker News

whimsicalismlast Thursday at 2:23 AM1 replyview on HN

Makes sense, I think streaming audio->audio inference is a relatively big lift.


Replies

red2awnlast Thursday at 9:21 AM

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

show 1 reply