> I find it confusing when Unicode "shadows" of normal letters exist, and those are of ...

jjmarr • last Tuesday at 2:01 PM • 2 replies • view on HN

> I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like

Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.

Replies

Sesse__ • last Tuesday at 3:25 PM

It's why the Unicode Collation Algorithm exists.

If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:

  004B  ; [.2514.0020.0008] # LATIN CAPITAL LETTER K
  212A  ; [.2514.0020.0008] # KELVIN SIGN

The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.

Compare e.g. :

  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K

which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).

➕ show 1 reply

ComputerGuru • last Tuesday at 2:41 PM

Normalization wouldn’t address this.

➕ show 2 replies

alt Hacker News

Replies