Yes a separate model is performing ASR in this case. Gemma270m (base, function, and others) are not multimodal out of the box.
That being said if someone in the community wanted to use other encoders like siglip and plug them into Gemma270m to make it multimodal that'd be a great way to have fun over break and build up an AI Eegineer resume :)