>publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately.
Google is still releasing a lot of llm architecture research. They introduced speculative decoding of LLMs in 2022[1], then released the code to perform sceculative decoding for their Gemma 4 model this year[2]
[1] https://arxiv.org/abs/2211.17192
[2] https://github.com/google-gemma/cookbook/blob/main/docs/mtp/...
They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept in a separate file and have to be welded in by the inference engine.
Qwen 3.6 shipped with working MTP first, and had working MTP in llama.cpp first.
They also shipped Gemma models with their new Matformer architecture which allows for dynamic computation.
[dead]
Thanks for the clarification - Google does publish more than others - and I actually really appreciate the work they are doing with the Gemma models, which are truly competitive open models. I do wish they’d publish more in depth papers on their Gemma models but appreciate that they are open weights.