It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.
> https://huggingface.co/google/t5gemma-2-1b-1b
From here it looks like it still is long context and multimodal though?
>Inputs and outputs Input:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens Output:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context up to 32K tokens
> https://huggingface.co/google/t5gemma-2-1b-1b
From here it looks like it still is long context and multimodal though?
>Inputs and outputs Input:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens Output:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context up to 32K tokens