It's why the Unicode Collation Algorithm exists. If you look in allkeys.txt (the base UCA dat...

Sesse__ • last Tuesday at 3:25 PM • 1 reply • view on HN

It's why the Unicode Collation Algorithm exists.

If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:

  004B  ; [.2514.0020.0008] # LATIN CAPITAL LETTER K
  212A  ; [.2514.0020.0008] # KELVIN SIGN

The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.

Compare e.g. :

  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K

which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).

Replies

happytoexplain • last Tuesday at 4:46 PM

Are the meanings for the levels for each code point defined somewhere (accent, casing, etc)?

➕ show 1 reply

alt Hacker News

Replies