You need the regular gemma model as well. You can think of this as a really small distillation of the original. Useless by its own because it often is wrong, but it is fifth more than not. And because verifying a transformer model can be done faster than running it. We can effectively speed up by using this draft model and only doing the compute where it was wrong.
This is a oversimplification, but tldr you need both yes.
Thank you!
I already played with Gemma4 on oMLX a while ago. When I have some time I'll check if it supports running MTP models and play a bit more