logoalt Hacker News

hrimfaxiyesterday at 1:00 PM2 repliesview on HN

Why does this G character appear to prefix most of the output? ("Ġlike")


Replies

frwickstyesterday at 1:31 PM

It is a tokenizer artifact most likely (https://github.com/huggingface/transformers/issues/4786). So the output is not properly decoded in this case, it should just be a space.

kgeistyesterday at 1:30 PM

The original tokens have Ġ instead of space. I had this issue too when writing an inference engine for Qwen. You have to "normalize" those special characters.