I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.
I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.