Can you also compare the performance with https://github.com/huggingface/tokenizers/? Would be helpful, since the benchmark in the tiktoken readme seems to be very outdated.
Anecdotally I've always found tiktoken to be far slower than huggingface tokenizers. I'm not sure why, as I haven't dug into tiktoken, but I'm a heavy user of HF's rust tokenizers
Anecdotally I've always found tiktoken to be far slower than huggingface tokenizers. I'm not sure why, as I haven't dug into tiktoken, but I'm a heavy user of HF's rust tokenizers