logoalt Hacker News

Confusables.txt and NFKC disagree on 31 characters

52 pointsby pimterrylast Monday at 12:55 PM36 commentsview on HN

Comments

akerstentoday at 2:52 PM

Unicode is both the best thing that's ever happened to text encoding and the worst. The approach I take here is to treat any text coming from the user as toxic waste. Assume it will say "Administrator" or "Official Government Employee" or be 800 pixels tall because it was built only out of decorative combining characters. Then put it in a fixed box with overflow hidden, and use some other UI element to convey things like "this is an official account."

The worst part that this article doesn't even touch on with normalizing and remapping characters is the risk your login form doesn't do it but your database does. Suddenly I can re-register an existing account by using a different set of codepoints that the login system doesn't think exists but the auth system maps to somebody else's record.

show 1 reply
joshdatatoday at 3:37 PM

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.

In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

show 3 replies
Liftyeetoday at 3:18 PM

Does the "removing dead code" advantage outweigh the additional complexity of having to maintain 2 different confusables lists: one for when NFKC has been applied first and one without? It didn't sound like applying one after the other caused any errors, just that some previously reachable states are unreachable.

show 1 reply
happytoexplaintoday at 3:21 PM

Tangential - I'm aware of various types of, let's say, "swappability" that Unicode defines (broader than the Unicode concept of "equivalence"):

- Canonical (NF)

- Compatible (NFK)

- Composed vs decomposed

- Confusable (confusables.txt)

Does Unicode not define something like "fuzzy" equivalence? Like "confusable" but more broad, for search bar logic? The most obvious differences would be case and diacritic insensitivity (e, é). Case is easy since any string/regex API supports case insensitivity, but diacritic insensitivity is not nearly as common, and there are other categories of fuzzy equivalence too (e.g. ø, o).

I guess it makes sense for Unicode to not be interested in defining something like this, since it relates neither to true semantics nor security, but it's an incredibly common pattern, and if they offered some standard, I imagine more APIs would implement it.

show 1 reply
kccqzytoday at 2:58 PM

If you allow users to submit arbitrary Unicode string as text, why would you need to check confusables.txt? Whose confusion are you guarding against?

show 1 reply
brazzytoday at 2:35 PM

> The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it

That is a really bad and user-hostile thing to do. Many of those characters are perfectly valid characters in various non-latin scripts. If you want everyone to force Latin script for identifiers, then own up to it and say so. But rejecting just some them for being too similar to latin characters just makes the behaviour inconsistent and confusing for users.

show 3 replies
csensetoday at 3:30 PM

My theory: The "long S" in "Congreſs" is an f. They used f instead of s because without modern dental care, a lot of people in the 1600's and 1700's were miffing teeth and fpoke with a lifp.

show 2 replies