DTMF was designed to interoperate with human voice and the tones were chosen on purpose to be unlikely or impossible for human voice to trigger. If there is no human voice, you don't need to use DTMF you could use any number of tones. I wonder if you could use base64 or base58 with 64 or 58 unique tones and be able to send text at a reasonable rate?
They could have used modem standards. Bell 103 standard is 300 bit/s with frequency shift keying (FSK).