Nice work! I tried something similar a while back ago:

kevmo314 • yesterday at 2:56 PM • 1 reply • view on HN

Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

Replies

matthewolfe • yesterday at 3:12 PM

Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.

alt Hacker News

Replies