logoalt Hacker News

Zigurdtoday at 2:57 PM1 replyview on HN

Based on the code that it's good at, and the code that it's terrible at, you are exactly right about LLMs being shaped by their training material. If this is a fundamental limitation I really don't see general purpose LLMs progressing beyond their current status is idiot savants. They are confident in the face of not knowing what they don't know.

Your experience with Arabic in particular makes me think there's still a lot of training material to be mined in languages other than English. I suspect the reason that Arabic sounds 20 years ago is that there's a data labeling bottleneck in using foreign language material.


Replies

parineumtoday at 4:10 PM

I've had a suspicion for a bit that, since a large portion of the Internet is English and Chinese, that any other languages would have a much larger ratio of training material come from books.

I wouldn't be surprised if Arabic in particular had this issue and if Arabic also had a disproportionate amount of religious text as source material.

I bet you'd see something similar with Hebrew.

show 2 replies