A text based interface is perfect for interacting with a large language model, and it seems unsurprising to me that it's the most popular way to work with them.
Frankly, the idea of having to decipher what a picture is supposed to represent to use a skill fills me with horror.