Transformers do this for any stream of tokens, those tokens can map to anything you want, and you will get lossy compression. Text produced by humans just happens to be dense, available, and a useful prior, but it is not intrinsically required. See 3D vision transformers for example.