From the linked post, it didn't read like a separate KV cache was needed: > The draft mode...

a_e_k • yesterday at 6:06 PM • 1 reply • view on HN

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

Replies

coder543 • yesterday at 6:09 PM

That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.

alt Hacker News

Replies