Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.