> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.
It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306)
> Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.
Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of NJ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A).
What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case?