Very Interesting... I have similar idea to train LLM in Serbian, create even new encoding

topce • yesterday at 10:35 PM • 1 reply • view on HN

Very Interesting...

I have similar idea to train LLM in Serbian, create even new encoding https://github.com/topce/YUTF-8 inspired by YUSCII. Did not have time and money ;-) Great that you succeed. Idea if train in Serbian text encoded in YUTF-8 (not UTF-8) it will have less token when prompt in Serbian then English, also Serbian Cyrillic characters are 1 byte in YUTF-8 instead of 2 in UTF.Serbian language is phonetic we never ask how you spell it.Have Latin and Cyrillic letters.

Replies

xodn348 • yesterday at 10:43 PM

Really interesting approach — attacking token efficiency at the encoding level is more fundamental than what I did.

Even without retraining BPE from scratch, starting with YUTF-8 and measuring how existing tokenizers handle it would already be a worthwhile experiment.

Hope you find the time to build it, good luck!

alt Hacker News

Replies