Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie
The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?
Cool!
I've reached out to the guy who maintains Tiktoken to talk about this.