> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode ID...

joshdata • today at 3:37 PM • 3 replies • view on HN

> If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it)

That's not right. Most of the web requires NFC normalization, not NFKC. NFC doesn't lose information in the original string. It reorders and combines code points into equivalent code point sequences, e.g. to simplify equality tests.

In NFKC, the K for "Compatibility" means some characters are replaced with similar, simpler code points. I've found NFKC useful for making text search indexes where you want matches to be forgiving, but it would be both obvious and wrong to use it in most of the web because it would dramatically change what the user has entered. See the examples in https://www.unicode.org/reports/tr15/.

Replies

ZoneZealot • today at 4:26 PM

I think we're expecting too much from an LLM generated article from a user that has been spending a lot of time spamming their content across multiple platforms and websites.

paultendo • today at 7:55 PM

Thanks Josh - putting this article out there has pushed me to sharpen a lot of my thinking which hopefully should come across in my more recent work. I've updated the article to scope the NFKC recommendation to identifiers and added a note crediting your correction. Thanks for catching it.

bawolff • today at 7:41 PM

I feel like for search, NFKD and then remove all the combining characters would be a better bet than NFKC.

Of course there are also purpose specific algorithms for preparing text for search that would be even better.

alt Hacker News

Replies