I've had a suspicion for a bit that, since a large portion of the Internet is English and Chine...

parineum • today at 4:10 PM • 2 replies • view on HN

I've had a suspicion for a bit that, since a large portion of the Internet is English and Chinese, that any other languages would have a much larger ratio of training material come from books.

I wouldn't be surprised if Arabic in particular had this issue and if Arabic also had a disproportionate amount of religious text as source material.

I bet you'd see something similar with Hebrew.

Replies

mentalgear • today at 7:55 PM

I think therein lies another fun benchmark to show that LLM don't generalize: ask the llm to solve the same logic riddle, only in different languages. If it can solve it in some languages, but not in others, it's a strong argument for just straightforward memorization and next token prediction vs true generalization capabilities.

eshaham78 • today at 7:07 PM

[dead]

alt Hacker News

Replies