Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

254 points • by matthewolfe • yesterday at 12:33 PM • 70 comments • view on HN

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Comments

npalli • yesterday at 1:13 PM

Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.

➕ show 4 replies

chrismustcode • yesterday at 1:06 PM

There’s something beautiful about creating a drop in replacement for something that improves performance substantially.

ScyllaDB comes to mind

➕ show 1 reply

pama • yesterday at 2:08 PM

Cool. Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details. It also would be nice to have examples that work the other way around: initialize tiktoken as you normally would, including any specialized extension of standard tokenizers, and then use that initialized tokenizer to initialize a new tokendagger and test identity of results.

➕ show 2 replies

superlopuh • yesterday at 5:27 PM

Can someone familiar with performance of LLMs please tell me how important this is to the overall perf? I'm interested in looking into optimizing tokenizers, and have not yet run the measurements. I would have assumed that the cost is generally dominated by matmuls but am encouraged by the reception of this post in the comments.

➕ show 4 replies

p0 • yesterday at 1:59 PM

How does this compare to the BPE crate [1]? Its main selling point is support for incrementally re-tokenising text, but it's also faster than tiktoken.

[1] https://crates.io/crates/bpe

➕ show 1 reply

frabcus • yesterday at 2:04 PM

Is there any way we can get local tokenizers for other LLMs? e.g. Gemini only offer a remote API for their tokenizer. Is it proprietary? Could we infer the token mapping somehow efficiently by making lots of calls?

➕ show 3 replies

kevmo314 • yesterday at 2:56 PM

Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

➕ show 1 reply

Tiberium • yesterday at 5:11 PM

Can you also compare the performance with https://github.com/huggingface/tokenizers/? Would be helpful, since the benchmark in the tiktoken readme seems to be very outdated.

➕ show 1 reply

fkyoureadthedoc • yesterday at 1:56 PM

Would be cool to see WASM bindings for this here https://github.com/dqbd/tiktoken

Or maybe even your speedups from "b" in the pure js implementation

pamelafox • yesterday at 2:41 PM

Just curious whether it's possible to push any of your performance improvements to tiktoken itself?

➕ show 1 reply

isjustintime • yesterday at 8:45 PM

Very cool. We use Tiktoken and I'd love to see the performance impact. Pretty great decision to make it drop-in compatible.

b0a04gl • yesterday at 2:44 PM

if dagger builds a byte level DFA for special tokens and resolves overlaps via longest match, how does it handle inputs with partial matches at chunk boundaries, say a stream ends mid token like <|endo , does it buffer forward or require lookahead

semiinfinitely • yesterday at 10:04 PM

I'm relieved to see that its not written in rust

➕ show 1 reply

konsalexee • yesterday at 1:18 PM

> simplifying the algorithm to forego regex matching special tokens at all

Does that mean there could be cases with less quality in terms of tokenization?

➕ show 1 reply

matrix2596 • yesterday at 4:25 PM

is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken

➕ show 1 reply

polynomial • yesterday at 3:24 PM

Just to note that Tiktoken is still the tokenizer behind the GPT-4x series, it just uses a different token model. (Post only says GPT-3, implying they were using something else for subsequent iterations.)

EGreg • yesterday at 2:29 PM

What about pairing this with BigBird and Mamba?

manishsharan • yesterday at 1:56 PM

Is there a tokenizer someone can recommend for code ? I have tried CodeBert but maybe I am using it wrong as my results with it were pretty bad.

sheerun • yesterday at 8:53 PM

Now that byte-patch-level embeddings are discovered?

silentsea90 • yesterday at 3:38 PM

"I’m teaching myself LLM internals by re-implementing the stack from first principles." - curious what resources you're using? Any books or courses, or just building it straight up? Great work!

➕ show 2 replies

luppy47474 • yesterday at 6:04 PM

[flagged]

janwilmake • yesterday at 2:18 PM

You know what's also faster to roughly get the amount of tokens? string.length/5

➕ show 2 replies

alt Hacker News

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

Comments