Isn't finetuning the point of the T5 style models, since they perform better for smaller parameter counts?
It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.
It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.