logoalt Hacker News

vidarhyesterday at 10:12 PM1 replyview on HN

> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?

Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.

E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.

But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.

Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.


Replies

everforwardyesterday at 10:49 PM

This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.

show 1 reply