I'm working on TokenDagger [0] a high performance implementation of OpenAI's Tiktoken. My benchmarks are showing 2-3x higher throughput, as well as ~4x faster tokenization for code samples on a single thread.
[0] https://github.com/M4THYOU/TokenDagger