> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š ...

tkot • yesterday at 9:34 PM • 0 replies • view on HN

> It's less pronounced with diacritics, but enter Unicode normal forms: you can represent š either as š, or s followed by a diacritic. When you want to compare two strings, you have to normalize them to ensure you are comparing apples to apples. I can guarantee most software is broken in that regard. For Cyrillic, it just works.

It's the same with Unicode encoding of Cyrillic letters - й (U+0439) can be written as й (и U+0438 + ◌̆ U+0306)

> Interestingly — and not many know this — Unicode includes separate codepoints for all of the digraphs too. While well-intentioned, it only makes the problem worse.

Based on your description it seems that the root cause of the issues is using two letters to represent the digraph - for example N (U+004E) J (U+004A) instead of Ǌ (U+01CA) - and the sorting issues would be identical if people typed Н (U+041D) Ь (U+042C)instead of Њ (U+040A).

What's the reason for the digraph being substituted by 2 letters in the first case more often than in the second case?

alt Hacker News