They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.
> 128k context.
don't care. prove effective context length or gtfo.
What is an encoder-decoder model, is it some kind of LLM, or a subcomponent of an LLM?
What is the "X" in the pentagonal performance comparison, is it multilingual performance or something else?
What's the use case of models like T5 compared to decoder-only models like Gemma? More traditional ML/NLP tasks?
> Note: we are not releasing any post-trained / IT checkpoints.
I get not trying to cannibalize Gemma, but that's weird. A 540M multimodel model that performs well on queries would be useful and "just post-train it yourself" is not always an option.