Can someone familiar with performance of LLMs please tell me how important this is to the overall pe...

superlopuh • yesterday at 5:27 PM • 4 replies • view on HN

Can someone familiar with performance of LLMs please tell me how important this is to the overall perf? I'm interested in looking into optimizing tokenizers, and have not yet run the measurements. I would have assumed that the cost is generally dominated by matmuls but am encouraged by the reception of this post in the comments.

Replies

refibrillator • yesterday at 5:49 PM

Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference.

GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.

Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.

serjester • yesterday at 5:44 PM

Tokenizing text is ridiculously small part of the overall computation that goes into serving a request. With that said if you’re doing this on petabytes of data, never hurts to have something faster.

➕ show 1 reply

matthewolfe • yesterday at 10:14 PM

To echo the other replies, the tokenizer is definitely not the bottleneck. It just happens to be the first step in inference, so it's what I did first.

benreesman • yesterday at 8:04 PM

Tokenization performance is complicated, but your guidepost is that the institutions with the resources and talent to do so choose to write extremely fast tokenizers: sentencepiece and tiktoken both pay dearly in complexity (particularly complexity of deployment because now you've got another axis of architecture-specific build/bundle/dylib to manage in addition to whatever your accelerator burden always was: its now aarch64 cross x86_64 cross CUDA capability...)

Sometimes it can overlap with accelerator issue, but pros look at flame graphs: a CPU core running the AVX lanes hard isn't keeping the bus fed, million things. People pre-tokenize big runs all the time.

I don't know why this thread is full of "nothing to see here", this obliterates the SOTA from the money is no object status quo: I'd like to think better of the community than the obvious which is that C++ is threatening a modest mindshare comeback against a Rust narrative that's already under pressure from the explosion of interest in Zig. Maybe there's a better reason.

alt Hacker News

Replies