logoalt Hacker News

frabcusyesterday at 2:04 PM3 repliesview on HN

Is there any way we can get local tokenizers for other LLMs? e.g. Gemini only offer a remote API for their tokenizer. Is it proprietary? Could we infer the token mapping somehow efficiently by making lots of calls?


Replies

Deathmaxyesterday at 2:56 PM

Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].

Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.

[1] https://github.com/google/sentencepiece

[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...

[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."

[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."

matthewolfeyesterday at 3:02 PM

A lot of model-specific tokenizers have reference implementations ([0], [1]). Underlying them is a core algorithm like SentencePiece or Byte-pair encoding (BPE). Tiktoken and TokenDagger are BPE implementations. The wrapping "tokenizer" mostly deals with the quirks of the vocabulary and handling special tokens.

For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.

[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...

[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...

webereryesterday at 2:45 PM

I thought Gemini used SentencePiece

https://github.com/google/sentencepiece