logoalt Hacker News

coder543yesterday at 5:43 PM1 replyview on HN

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.


Replies

a_e_kyesterday at 6:06 PM

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

show 1 reply