logoalt Hacker News

Sesse__last Tuesday at 3:25 PM1 replyview on HN

It's why the Unicode Collation Algorithm exists.

If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:

  004B  ; [.2514.0020.0008] # LATIN CAPITAL LETTER K
  212A  ; [.2514.0020.0008] # KELVIN SIGN
The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.

Compare e.g. :

  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K
which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).

Replies

happytoexplainlast Tuesday at 4:46 PM

Are the meanings for the levels for each code point defined somewhere (accent, casing, etc)?

show 1 reply