This t5 is multimodal.
Also a hint: you can create a finetuning dataset from a frontier LLM pretty easily to finetune those t5 and effectively distill them pretty fast these days