Very cool! I was wondering, is a separate model performing speech recognition for the voice demos such as the game? The FunctionGemma model card only seems to show text input/output.
Yes a separate model is performing ASR in this case. Gemma270m (base, function, and others) are not multimodal out of the box.
That being said if someone in the community wanted to use other encoders like siglip and plug them into Gemma270m to make it multimodal that'd be a great way to have fun over break and build up an AI Eegineer resume :)
Yes a separate model is performing ASR in this case. Gemma270m (base, function, and others) are not multimodal out of the box.
That being said if someone in the community wanted to use other encoders like siglip and plug them into Gemma270m to make it multimodal that'd be a great way to have fun over break and build up an AI Eegineer resume :)