logoalt Hacker News

kevmo314yesterday at 2:56 PM1 replyview on HN

Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?


Replies

matthewolfeyesterday at 3:12 PM

Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.