Transformers operate on images and a variety of sensor data. They can also operate completely on non-textual inputs and outputs. I don't know what the ceiling on their capabilities is, but the complaint that they only operate on text seems just obviously wrong. There are numerous examples but one is meteorological forecasting which takes in a variety of time series sensor inputs and outputs e.g. time-series temperature maps. https://www.nature.com/articles/s41598-025-07897-4