> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.
E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.
But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.
Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.
This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.