MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP ...

coder543 • yesterday at 5:43 PM • 1 reply • view on HN

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

Replies

a_e_k • yesterday at 6:06 PM

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

➕ show 1 reply

alt Hacker News

Replies