Correct, it's breaks the single prompt, single completion assumption baked into the frameworks....

red2awn • last Thursday at 9:21 AM • 1 reply • view on HN

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

Replies

whimsicalism • last Thursday at 5:37 PM

I imagine you have to start decoding many speculative completions in parallel to have true low latency.

alt Hacker News

Replies